Can Parquet be the DATF default (or even exclusive) on-disk storage format?

yymao commented 5 years ago

When GCR was original developed, the philosophy was that it would be difficult to convince people to adopt the same on-disk storage format, but we still want to have a common user interface to access different datasets that are stored in different formats.

Fast forward to now, despite the convenience of GCR, supporting multiple different on-disk storage formats creates much overhead (e.g., creating a new reader for each dataset), and it seems to be a good time to revisit the question of finding a common on-disk storage format, at least to serve as a default format, if not exclusive.

What are the requirements of such a on-disk storage format that would fit our needs? From our experience:

It should be a columnar format: in most use cases, a small fraction of columns will be loaded/queried at once.
It should support partitioning: in many use cases, users would want to iterate over chucks of the dataset (e.g., patches of sky) or use just a subset (e.g. one chunk) of the full dataset, especially during interactive sessions.
It should have high performance data IO: for obvious reasons. Support of multithreaded reading may be needed too.
It probably should support multidimensional arrays: for the storage of, for example, the PDF of photo-z.
It probably should support masking.

I think Parquet meets all the requirements listed above (note: Parquet does not actually support multidimensional array, but it supports nested array), so the question is what other considerations should we think about before we move to use Parquet as our common on-disk storage format.

JulienPeloton commented 5 years ago

Thanks @yymao for this, I fully agree with your proposal. I would add few additional points in the requirements:

it should be easily read and wrote by common tools (i.e. having many connections with the existing ecosystem of tools that are used in our community): for example in Python, parquet can be currently manipulated by Pandas, Dask, Apache Spark, ...
it should have a long-term support: Parquet is maintained by the Apache foundation and developed by a large community of developers.
It should be open-source.

katrinheitmann commented 5 years ago

@rmandelb Hi there, this issue is actually important for the analysis working groups since they will be interacting with the DC2 data. How can we get some attention to this issue? Thanks a lot!

wmwv commented 5 years ago

@rmandelb Hi there, this issue is actually important for the analysis working groups since they will be interacting with the DC2 data. How can we get some attention to this issue? Thanks a lot!

@katrinheitmann There is attention to this issue within the DATF. What specifically do you want to know?

wmwv commented 5 years ago

I've clarified the title to emphasize that this issue is meant to talk about how we represent the data products the DATF is providing access to.

rbiswas4 commented 5 years ago

@yymao This seems to be really catching on. Two quick questions: What about compression levels and is there any hope of parallel writing (or equivalently a library that offers locking?), or is the paradigm that partitions should be small and written in serial?

wmwv commented 5 years ago

@rbiswas4

What use case do you have in mind?
What kind of compression are you thinking about?
When generating things so far we've generally been using a trivial extraction (scatter) and then explicit merge (gather). For particular use cases, there are some level of parallelization possible. But Parquet is designed more as a format for interactive reading rather than interactive writing.

(I get the sense that this discussion is going beyond the smaller original motivation of standardizing within DATF, but that's totally fine.)

rbiswas4 commented 5 years ago

@wmwv

Ah yes, Sorry I might be moving out of the context, but

I began to think about IO with light curves and results.
compression as possible in hdf.fits
OK, that is also what I believe is possible with the other formats.

So, it seems like parquet would not force people to give up things, and might be a good format even outside of DATF to try out.

wmwv commented 5 years ago

I began to think about IO with light curves and results.

It's much less clear to me the Parquet is the right file format for non-block writes. It's predicated around providing a known schema. You can use it in a strict append, and people do, but it's not particularly where it shines. It's a column-based store, so it's not designed to optimize row-based appends.

compression as possible in hdf.fits

? I don't understand what you mean. Can you be specific ? Are you talking about using a binary file format instead of CSV, gzipping files, RICE-compression of image blocks, using 16-bit floats instead of 64-bit? Are you concerned about file storage space or performance?

wmwv commented 5 years ago

For a bit of an update, At the March 18 DATF meeting, this issue was discussed and two things came out:

A desire for a clear default and instructions on converting to that default file format to make it easier for both data-product creators and DATF providers.
Parquet was fine.

rbiswas4 commented 5 years ago

@wmwv Sorry getting back to your last but one comment:

I see, thanks!. OK but is there something else that you would recommend over Parquet currently? Or is it that you are not that happy with parquet, but it is not worse than anything else people are using.
I was thinking of binary instead of csv like, and using smaller bits for many quantities, motivated by both space and performance.

dkirkby commented 5 years ago

Some additional context: the project is also using parquet for replicating qserv data on disk (although parquet is not the qserv native format), which has the added benefit of enabling some powerful DASK workflows in the LSST Science Platform.

katrinheitmann commented 5 years ago

@yymao I think the answer to the question in the header is "yes". If you agree, could you write a short conclusion and close the issue? Thanks!

yymao commented 5 years ago

After the discussions here, within the DATF, and with some analysis teams, we have reached the conclusion that DATF will use parquet as its internal, on-disk storage format. Multiple reasons have been documented above, among them it's worth noting that the Project will also be using parquet for their Data Release Product.

At the time I am writing this, DATF has switched to parquet for most of our Data Release Data Products (except for the object catalogs, see #342). For other existing data products (e.g. cosmoDC2), no conversion will be made. For new data products that are not Data Release Data Products (e.g., add-on catalogs), we will ask (but not yet strictly enforce) that the catalog creators to use parquet format.

LSSTDESC / DC2-production

Can Parquet be the DATF default (or even exclusive) on-disk storage format? #333