Closed yymao closed 5 years ago
Thanks @yymao for this, I fully agree with your proposal. I would add few additional points in the requirements:
@rmandelb Hi there, this issue is actually important for the analysis working groups since they will be interacting with the DC2 data. How can we get some attention to this issue? Thanks a lot!
@rmandelb Hi there, this issue is actually important for the analysis working groups since they will be interacting with the DC2 data. How can we get some attention to this issue? Thanks a lot!
@katrinheitmann There is attention to this issue within the DATF. What specifically do you want to know?
I've clarified the title to emphasize that this issue is meant to talk about how we represent the data products the DATF is providing access to.
@yymao This seems to be really catching on. Two quick questions: What about compression levels and is there any hope of parallel writing (or equivalently a library that offers locking?), or is the paradigm that partitions should be small and written in serial?
@rbiswas4
(I get the sense that this discussion is going beyond the smaller original motivation of standardizing within DATF, but that's totally fine.)
@wmwv
Ah yes, Sorry I might be moving out of the context, but
So, it seems like parquet would not force people to give up things, and might be a good format even outside of DATF to try out.
- I began to think about IO with light curves and results.
It's much less clear to me the Parquet is the right file format for non-block writes. It's predicated around providing a known schema. You can use it in a strict append, and people do, but it's not particularly where it shines. It's a column-based store, so it's not designed to optimize row-based appends.
- compression as possible in hdf.fits
? I don't understand what you mean. Can you be specific ? Are you talking about using a binary file format instead of CSV, gzipping files, RICE-compression of image blocks, using 16-bit floats instead of 64-bit? Are you concerned about file storage space or performance?
For a bit of an update, At the March 18 DATF meeting, this issue was discussed and two things came out:
@wmwv Sorry getting back to your last but one comment:
I see, thanks!. OK but is there something else that you would recommend over Parquet currently? Or is it that you are not that happy with parquet, but it is not worse than anything else people are using.
I was thinking of binary instead of csv like, and using smaller bits for many quantities, motivated by both space and performance.
Some additional context: the project is also using parquet for replicating qserv data on disk (although parquet is not the qserv native format), which has the added benefit of enabling some powerful DASK workflows in the LSST Science Platform.
@yymao I think the answer to the question in the header is "yes". If you agree, could you write a short conclusion and close the issue? Thanks!
After the discussions here, within the DATF, and with some analysis teams, we have reached the conclusion that DATF will use parquet as its internal, on-disk storage format. Multiple reasons have been documented above, among them it's worth noting that the Project will also be using parquet for their Data Release Product.
At the time I am writing this, DATF has switched to parquet for most of our Data Release Data Products (except for the object catalogs, see #342). For other existing data products (e.g. cosmoDC2), no conversion will be made. For new data products that are not Data Release Data Products (e.g., add-on catalogs), we will ask (but not yet strictly enforce) that the catalog creators to use parquet format.
When GCR was original developed, the philosophy was that it would be difficult to convince people to adopt the same on-disk storage format, but we still want to have a common user interface to access different datasets that are stored in different formats.
Fast forward to now, despite the convenience of GCR, supporting multiple different on-disk storage formats creates much overhead (e.g., creating a new reader for each dataset), and it seems to be a good time to revisit the question of finding a common on-disk storage format, at least to serve as a default format, if not exclusive.
What are the requirements of such a on-disk storage format that would fit our needs? From our experience:
I think Parquet meets all the requirements listed above (note: Parquet does not actually support multidimensional array, but it supports nested array), so the question is what other considerations should we think about before we move to use Parquet as our common on-disk storage format.