chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

cellxgene-schema _is_raw must validate all raw values #382

Closed brianraymor closed 1 year ago

brianraymor commented 1 year ago

Context

Reported by @bkmartinjr. See #single-cell-data-wrangling.

The original _is_raw implementation validated only 5000 values as a default heuristic:

:param int max_values_to_check: total values to check, default set to 5000 due to performance concerns.

This default must be removed. All values must be valid.

nayib-jose-gloria commented 1 year ago

As part of the output of this change, please also provide a few performance benchmarks to determine the performance hit of this change (and thus, if any changes need to be made to the batch migration or dataset upload job timeouts to accommodate, for instance)

jychien commented 1 year ago

We’ve recently had 4 very large datasets in a private collection, and their size was comparable to the large SEA-AD datasets (~45Gb per h5ad file). Wanted to give you a heads up that they’re there and was also curious about how long the validation in the portal will take after this ticket fix goes in.

Bento007 commented 1 year ago

@nayib-jose-gloria do we still want this to be configurable somewhere or completely removed?

nayib-jose-gloria commented 1 year ago

@Bento007 Sounds like completely removed.

Also, looks like Bruce had a suggested implementation in the slack thread linked in the description. However, he also said it requires reading into memory to be performant

jahilton commented 1 year ago

This one passes our QA (our notebook)