bigbio / quantms.io

The proteomics quantification format, extending mzTab for large scale datasets.
Other
5 stars 4 forks source link

[Discussion] Sharding parquet files #45

Closed jspaezp closed 2 weeks ago

jspaezp commented 4 months ago

Motivation: One of the issue with parquet is that every column has to be read as a whole (contrary to a csv where an offset can be added and lines can be read individually) ... so if lets say ... I wanted to read all the values for 1 protein, I would need to read all the matrix no matter what. On the other hand, if it is sharded on the protein level, I only need to read the shard that has that specific protein. This does not matter most of the time but would enable a lot of server-less workflows (where resources are very limited in terms of compute and ram but not storage).

Suggested implementation: To support sharded parquet files I am thinking of this variant of the configuration, where plurals are allowed with a list and with a position for the sharding information.

(I know json does not allow comments but bare with me here :P)

"quantms_files": [
     ....
     ## Instead of this one
     # {"absolute_file":     "PXD004683-958e8400-e29b-41f4-a716-446655440000.absolute.tsv"},
     ## This one
     {"absolute_files": [
              "PXD004683-958e8400-e29b-41f4-a716-446655440000.01.absolute.tsv",
              "PXD004683-958e8400-e29b-41f4-a716-446655440000.02.absolute.tsv"
          ]
     },
     ...
]

LMK what you think!

ypriverol commented 4 months ago

@jspaezp We did though about the sharding for the psms tables and I think is a great idea to include enable in the json files to enable multiple files for a given category.

Another interesting to explore is the parquet library self-explanatory that knows what was the column used to do the sharding or do we need to include something in the json about the sharding for the readers, for example column or columns that were sharded.

jpfeuffer commented 4 months ago

You could also have a look at hive-style partitioned parquet.

But in general, maybe it is enough to adapt how row groups are formed and row group sizes a bit better when writing the parquet. AFAIK you shouldn't have to read all data rows with parquet (especially with libraries that support lazy data frames) even if unpartitioned.

Maybe a mix would be optimal. Partition by input file origin or something like that and form row groups per protein.

jspaezp commented 4 months ago

Did a bit more reading on the format (https://www.influxdata.com/blog/querying-parquet-millisecond-latency/) and I think my experience was with files that did not have internal grouping and needed all columns for a filter operation (thus each column has to be read as a whole for every "query"). So you are right, you would not read at once all the file, it would read all of the column 'at once' (in theory ... I think libs like duckdb are better are preserving memory but i would expect the operation to be analogous).

Regarding the possibility of explicit sharding ... there could be some convention if the column is explicit in the format. (although I find it hard to enforce ...). Furthermore, if the sharding is useful at read time, libraries like polars or duckdb will use the range automatically (as far as I know, since the scan stores stuff like the ranges of the values)

My inclination would de to allow any form of sharding depending on the use case of whoever generated the data. I can see how some might favor splitting based on parent sample, some might favor peptide-sequence based, and in the worst case scenario I don't think it would be worse performance-wise than a single un-partitioned file.

Idea for explicit sharding ...

{"absolute_files": [
              "PXD004683-958e8400-e29b-41f4-a716-446655440000.pep01.absolute.tsv",
              "PXD004683-958e8400-e29b-41f4-a716-446655440000.pep02.absolute.tsv"
              # OR ... "PXD004683-958e8400-e29b-41f4-a716-446655440000.pep_ak.absolute.tsv"
              # + ... "PXD004683-958e8400-e29b-41f4-a716-446655440000.pep_kw.absolute.tsv"

          ]
     }

(the parquet format is much more interesting than I thought, lots to learn here)

ypriverol commented 2 weeks ago

Solve in #56