data file structure - Githubissues

7yl4r commented 7 years ago

hmm... how to organize all this atomic data?

Perhaps into steps?

./data
    /A-ingest
        /btc
        /trends
        /...other {data_source}s
    /B-preprocess
        /B-1-sample
            /{data_source}
        /B-2-interpolate
            /{data_source}
    /C-model
        /{data_source}
            /{model_type}
    /D-model-evaluation
    /E-forecast
    /F-actions
    /G-action-evaluation

7yl4r commented 7 years ago

does it make sense to arrange modules like this too? I don't like the idea of using the A/B/C in the package names to order them... but the only alternative I can think of is not ordering them.

7yl4r commented 7 years ago

or perhaps top-level packages should be organized by data_source with a common structure base on steps above?

./CryptoForecast/
    /trends/
        /ingest
        /preprocess
        /model
        /model-evaluation
        /forecast
    /btc
        /...(same as above)

7yl4r commented 7 years ago

yes, I like that better for code and for data organization. There is one caveat however and that is: generic and/or abstract classes will also be top-level in their own package and cross-data_source models may get a bit funky as well.

7yl4r commented 7 years ago

So let's think about versioning atomic data with this new paradigm. Options:

top-level versioning implemented with a custom script to move data when I feel like it:

/data
/trends/
    /ingest
    /preprocess
    /model
    /model-evaluation
    /forecast
/btc
    /...(same as above)
/data_1/
(same as above)
/data_2/
(same as above)
# data further back than this gets deleted

file-level versioning implemented within the classes themselves:

/data
/trends/
    /ingest
        /ingestFileA_1
        /ingestFileA_2
        /ingestFileB_1
        /ingestFileB_2
    /preprocess
    /model
    /model-evaluation
    /forecast
/btc
    /...(same as above)
/data_1/
(same as above)
/data_2/
(same as above)
# data further back than this gets deleted

2 allows for easier comparison between versions, but versions are harder to manage manually so I'm thinking 1.

7yl4r / crypto-forcast

data file structure #1