7yl4r / crypto-forcast

:moneybag: ingest, model, forecast, :moneybag:! ...but also I'm just learning Luigi
13 stars 5 forks source link

data file structure #1

Open 7yl4r opened 7 years ago

7yl4r commented 7 years ago

hmm... how to organize all this atomic data?

Perhaps into steps?

./data
    /A-ingest
        /btc
        /trends
        /...other {data_source}s
    /B-preprocess
        /B-1-sample
            /{data_source}
        /B-2-interpolate
            /{data_source}
    /C-model
        /{data_source}
            /{model_type}
    /D-model-evaluation
    /E-forecast
    /F-actions
    /G-action-evaluation
7yl4r commented 7 years ago

does it make sense to arrange modules like this too? I don't like the idea of using the A/B/C in the package names to order them... but the only alternative I can think of is not ordering them.

7yl4r commented 7 years ago

or perhaps top-level packages should be organized by data_source with a common structure base on steps above?

./CryptoForecast/
    /trends/
        /ingest
        /preprocess
        /model
        /model-evaluation
        /forecast
    /btc
        /...(same as above)
7yl4r commented 7 years ago

yes, I like that better for code and for data organization. There is one caveat however and that is: generic and/or abstract classes will also be top-level in their own package and cross-data_source models may get a bit funky as well.

7yl4r commented 7 years ago

So let's think about versioning atomic data with this new paradigm. Options:

  1. top-level versioning implemented with a custom script to move data when I feel like it:

    /data
    /trends/
        /ingest
        /preprocess
        /model
        /model-evaluation
        /forecast
    /btc
        /...(same as above)
    /data_1/
    (same as above)
    /data_2/
    (same as above)
    # data further back than this gets deleted
  2. file-level versioning implemented within the classes themselves:

    /data
    /trends/
        /ingest
            /ingestFileA_1
            /ingestFileA_2
            /ingestFileB_1
            /ingestFileB_2
        /preprocess
        /model
        /model-evaluation
        /forecast
    /btc
        /...(same as above)
    /data_1/
    (same as above)
    /data_2/
    (same as above)
    # data further back than this gets deleted

2 allows for easier comparison between versions, but versions are harder to manage manually so I'm thinking 1.