codeforgoodconf / black_holes

This project was created at CodeForGood 2017
5 stars 0 forks source link

Put data somewhere else #1

Closed sauln closed 7 years ago

sauln commented 7 years ago

From black_holes_backend created by sauln : codeforgoodconf/black_holes_backend#7

Currently, all the data is stored directly in the git repo. As we get more data to train on, this will become too big. The data should be stored somewhere that can be retrieved via a wget command.

sauln commented 7 years ago

sauln:

How expensive are the various preprocessing steps? Would it be better to store a processed version of the data instead of the original data?

frankamp:

There are conflicts between the data we have now as well (fits that appear in neg/pos and/or unknown-w-HE2). My rec: Start by deleting it all from both repos. Request all new fits in the three categories.

The preprocessing we do for ml isn't appropriate for visualization so it would be a tertiary intermediate format. I think there is no point in transforming for storage savings.

There is also the argument that a ml engine could find another supporting set of wavelengths or find a mitigating quality factor in the other flags that appear in the fits format.

Finally GM or whoever is working on similar work should have all these fits on disk already and if not find a group to share with as an unmodifiable source set.

Barring that, Amazon has open data publishing for free on S3, convince them of the big data potential.

sauln:

So all the data we have now is junk?

Sounds like getting S3 space would require a written proposal? Sean already has a few pending, so maybe that will be made available to us in the future.

In slack Matthew ( @simian201 ) mentioned drive or dropbox as an immediate solution to get the data out of the repo.

seanmarcia commented 7 years ago

Is there a sense how much storage is needed, both in the short and long terms?

sauln commented 7 years ago

Probably a couple of gigs for algorithm development. Production data could easily be 100x that though.

sauln commented 7 years ago

Data can be found on my Drive, shared publicly.

The file is almost 1gb and when untarballed, sits about 1.5 gb.