chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
78 stars 20 forks source link

Return a checksum value for fetched Census data #1028

Closed hthomas-czi closed 3 months ago

hthomas-czi commented 6 months ago

This helps users ensure their data query was fetched successfully.

User Quote

I’d like checksum for data downloaded to check for validity of downloads

hthomas-czi commented 5 months ago

@MaximilianLombardo To follow up with original requester

MaximilianLombardo commented 5 months ago

Additional context from the user:

The problem initially arose from the university's computing system timing out before large datasets could be completely pulled from the census.

The user found a workaround by fetching smaller sets of data (individual donors/samples) which reduced the need for pulling large datasets and was more successful.

In cases where jobs do not finish, there is sometimes a partial file left, which often cannot be read, highlighting the need for a way to verify data completeness.

bkmartinjr commented 5 months ago

@MaximilianLombardo - need to know which API was being used. In other words - fetching an H5AD, doing a slice from the SOMA objects, etc.

Some of these are possible to fix, others are quite challenging. Good to slice the requirement finer to determine if it is feasible to solve this issue

MaximilianLombardo commented 5 months ago

@bkmartinjr

need to know which API was being used. In other words - fetching an H5AD, doing a slice from the SOMA objects, etc.

ah ok - like specifically which function was being used?

bkmartinjr commented 5 months ago

ah ok - like specifically which function was being used

Yes, or alternatively a description of the access path/workflow. Doesn't need to be super detailed, but there are different data paths in the system, some of which can be more easily solved than others.

MaximilianLombardo commented 5 months ago

great, followed up with the user for this info

MaximilianLombardo commented 5 months ago

The user responded that they were using download_source_h5ad() at the time.

bkmartinjr commented 5 months ago

In which case, it is more or less the same as chanzuckerberg/single-cell-data-portal#4392

My expectation is that if these are added, the source will be the data portal

CC @brianraymor

brianraymor commented 3 months ago

@pablo-gar - can this be closed as a duplicate of chanzuckerberg/single-cell-data-portal#4392 per Bruce's comment? This is in the Q3 backlog. CC: @metakuni

pablo-gar commented 3 months ago

Yes, based on all the info provided by Harley anb Max it is certainly a duplicate of https://github.com/chanzuckerberg/single-cell-data-portal/issues/4392