Closed brianraymor closed 1 year ago
As part of the output of this change, please also provide a few performance benchmarks to determine the performance hit of this change (and thus, if any changes need to be made to the batch migration or dataset upload job timeouts to accommodate, for instance)
We’ve recently had 4 very large datasets in a private collection, and their size was comparable to the large SEA-AD datasets (~45Gb per h5ad file). Wanted to give you a heads up that they’re there and was also curious about how long the validation in the portal will take after this ticket fix goes in.
@nayib-jose-gloria do we still want this to be configurable somewhere or completely removed?
@Bento007 Sounds like completely removed.
Also, looks like Bruce had a suggested implementation in the slack thread linked in the description. However, he also said it requires reading into memory to be performant
This one passes our QA (our notebook)
Context
Reported by @bkmartinjr. See #single-cell-data-wrangling.
The original
_is_raw
implementation validated only 5000 values as a default heuristic::param int max_values_to_check: total values to check, default set to 5000 due to performance concerns.
This default must be removed. All values must be valid.