Closed evamaxfield closed 4 years ago
I get nervous when I hear "the raw data gets shoved up to quilt just fine" because it makes me think that a terabyte of image data is going to be uploaded to s3 many times a day while you wait. Can you reassure me?
I also didn't quite understand the parquet comment. I don't know how widely parquet is used, or what data you want to use it for. Are you talking about using it just as an archive/transmission format to and from quilt?
I get nervous when I hear "the raw data gets shoved up to quilt just fine" because it makes me think that a terabyte of image data is going to be uploaded to s3 many times a day while you wait. Can you reassure me?
Sure. So there are two parts to this:
cvapipe {step} push
instead of cvapipe all push
. It will be pretty obvious to the user if someone tries to push something up to quilt that is large as they will get a big old progress bar that says "Hashing ... X GB" or "Hashing ... X TB". Which personally I would say, if I saw those messages and I wasn't on master
or it wasn't from CI, I would keyboard interrupt.I also didn't quite understand the parquet comment. I don't know how widely parquet is used, or what data you want to use it for. Are you talking about using it just as an archive/transmission format to and from quilt?
parquet
is definitely growing in popularity as datasets become larger (faster io than csv), and, similar to us, data type metadata is pretty nice for a lot of datasets. Your last sentence is basically correct. Currently datastep
says "a manifest produced by a step must be a csv" (local_staging/raw/manifest.csv
) and I would like to allow it to be "a manifest produced by a step can be a csv or parquet file" and then all the steps in this repo would serialize and deserialize using parquet
instead of CSV (local_staging/raw/manifest.parquet
).
Pull request recommendations:
Related: datastep #60
This lays the groundwork for how I have been thinking of handling the raw data for the Cell Variance Analysis program.
Raw
is a step that simply validates that the dataset provided has some bare minimum set of fields that are required by downstream processing. Notice that I went with Jianxu's suggestion of aGoodCellIndicies
column.Currently because
datastep
assumes the manifest is acsv
the reading and writing of this list of ids is pretty bad, I am recommending that for the rest of this project we switch to reading and writingparquet
files (pd.read_parquet
) asparquet
is both faster than readingcsv
but also stores data type information in the headers which solves theList[int]
problem created byGoodCellIndicies
. (To do this I would need to release a new version ofdatastep
but wouldn't take long)As with all
datastep
workflows, you can see the data produced by this PR here. (In this case you don't really need to see anything other than that :tada: there is some sample of the raw data :tada:)I generated this dataset using the following:
TL;DR:
steps/raw.py
/ theRaw
step is just a pass-through from some dataset file. The raw data gets shoved up to quilt just fine (please readDEV_RECOMMENDATIONS
about storage). Before this is merged, I would like everyone's approval to switch toparquet
as the dataset serialization format so that data types are protected. The dataset that is produced by thisRaw
step is basically: "every row is an FOV + segmentations + metadata + aGoodCellIndicies
columns".There is some good documentation already in the repo under the README,
Thanks for contributing!