feature/add-raw-dataset-step

evamaxfield commented 4 years ago

Pull request recommendations:

[x] Name your pull request your-development-type/short-description. Ex: feature/read-tiff-files
[x] Link to any relevant issue in the PR description. Ex: Resolves [gh-12], adds tiff file format support

[x] Provide context of changes.

This lays the groundwork for how I have been thinking of handling the raw data for the Cell Variance Analysis program.

Raw is a step that simply validates that the dataset provided has some bare minimum set of fields that are required by downstream processing. Notice that I went with Jianxu's suggestion of a GoodCellIndicies column.

Currently because datastep assumes the manifest is a csv the reading and writing of this list of ids is pretty bad, I am recommending that for the rest of this project we switch to reading and writing parquet files (pd.read_parquet) as parquet is both faster than reading csv but also stores data type information in the headers which solves the List[int] problem created by GoodCellIndicies. (To do this I would need to release a new version of datastep but wouldn't take long)

As with all datastep workflows, you can see the data produced by this PR here. (In this case you don't really need to see anything other than that :tada: there is some sample of the raw data :tada:)

I generated this dataset using the following:

pip install -e .[aics]
python scripts/create_aics_dataset.py --sample 0.002

TL;DR: steps/raw.py / the Raw step is just a pass-through from some dataset file. The raw data gets shoved up to quilt just fine (please read DEV_RECOMMENDATIONS about storage). Before this is merged, I would like everyone's approval to switch to parquet as the dataset serialization format so that data types are protected. The dataset that is produced by this Raw step is basically: "every row is an FOV + segmentations + metadata + a GoodCellIndicies columns".

[x] Provide relevant tests for your feature or bug fix.
[x] Provide or update documentation for any feature added by your pull request.

There is some good documentation already in the repo under the README,

Thanks for contributing!

toloudis commented 4 years ago

I get nervous when I hear "the raw data gets shoved up to quilt just fine" because it makes me think that a terabyte of image data is going to be uploaded to s3 many times a day while you wait. Can you reassure me?

I also didn't quite understand the parquet comment. I don't know how widely parquet is used, or what data you want to use it for. Are you talking about using it just as an archive/transmission format to and from quilt?

evamaxfield commented 4 years ago

I get nervous when I hear "the raw data gets shoved up to quilt just fine" because it makes me think that a terabyte of image data is going to be uploaded to s3 many times a day while you wait. Can you reassure me?

Sure. So there are two parts to this:

This is why in the DEV_RECOMMENDATIONS file I added a section that says "Work with samples [of your data]". Yes we can handle TB of data if we wanted to, but it would be slow and our cost would increase.
Maybe I should add this to the documentation but if people want to shove data up to quilt, they should probably only shove up the data for the steps they are working on, so cvapipe {step} push instead of cvapipe all push. It will be pretty obvious to the user if someone tries to push something up to quilt that is large as they will get a big old progress bar that says "Hashing ... X GB" or "Hashing ... X TB". Which personally I would say, if I saw those messages and I wasn't on master or it wasn't from CI, I would keyboard interrupt.

I also didn't quite understand the parquet comment. I don't know how widely parquet is used, or what data you want to use it for. Are you talking about using it just as an archive/transmission format to and from quilt?

parquet is definitely growing in popularity as datasets become larger (faster io than csv), and, similar to us, data type metadata is pretty nice for a lot of datasets. Your last sentence is basically correct. Currently datastep says "a manifest produced by a step must be a csv" (local_staging/raw/manifest.csv) and I would like to allow it to be "a manifest produced by a step can be a csv or parquet file" and then all the steps in this repo would serialize and deserialize using parquet instead of CSV (local_staging/raw/manifest.parquet).

AllenCell / cvapipe

feature/add-raw-dataset-step #4