broadinstitute / single_cell_portal_core

Rails/Docker application for the Broad Institute's single cell RNA-seq data portal
https://singlecell.broadinstitute.org
BSD 3-Clause "New" or "Revised" License
62 stars 26 forks source link

Instant header validation for local AnnData files (SCP-5718) #2112

Closed eweitz closed 2 months ago

eweitz commented 2 months ago

This helps users avoid waiting 15-45 minutes to upload a big H5AD file, only to learn of a user error at the end.

Impact

Previously, authors would learn of any validation errors in their AnnData file only after uploading it and then checking their email. That's quite slow because .h5ad files in single cell science tend to be large: 2 GB, 6 GB, sometimes 60 GB. Needing to upload such big files to detect validation problems is frustrating.

Now, some problems are found before upload. Basic AnnData validation is done right in the web upload UI in less than a second. It validates AnnData metadata headers with a few existing rules, and reports any problems instantly and in context. The user journey is smoother, and thus faster.

We made AnnData upload generally available in July 2023, and have noticed a distinct uptick in its usage in the last few months. So improving this UX seems worthwhile.

Demo

Here's how it looks!

https://github.com/user-attachments/assets/7f21c8d4-33f5-4edf-94f6-c43193feb3bb

Technical notes

This uses hdf5-indexed-reader*, an efficient JavaScript library for HDF5 files. JS in the web browser parses the start of the AnnData file, reads obs keys, and applies some basic pre-existing client-side file validation (CSFV) rules. It reuses existing rules via minor rearrangement of parsed AnnData. So we can use the same validation rule code for both classic and AnnData files.

Parsing is streaming and non-blocking. That's important because AnnData files can be larger than client memory. We recognized this as an underlying blocker for AnnData back in November 2022, and started a community discussion about how to solve it. Since then, building on @bmaranville's foundational work, @jrobinso made hdf5-indexed-reader and we now have a library that neatly fits this need.

This changeset also includes a standalone HTML page and spinning DNA graphic. That enables nimbly experimenting with HDF5 and JS using something like Simple Web Server.

* Because hdf5-indexed-reader isn't on NPM, it can't be used as a conventional package. So I went ahead and published an org-scoped version of it, @single-cell-portal/hdf5-indexed-reader. The package.json source trivially adds a scope to the original name. This ensures the canonical version can use a simpler top-level name (presumably hdf5-indexed-reader) if / when it's published to NPM.

Next steps

Test

A new automated test confirms some behavior. To manually test:

  1. yarn install
  2. Ensure Vite frontend service worker cache is disabled
  3. Create a new study "AnnData CSFV"
  4. Select AnnData upload UX
  5. In AnnData tab, click "Choose file"
  6. In Finder, go to single_cell_portal_core/test/test_data/
  7. Select anndata_test_bad_header_no_species.h5ad
  8. Confirm error message is shown, including "File is missing required obs keys: species"
  9. Click "Delete" to remove errant file, then click "Choose file"
  10. In same directory as before, select anndata_test.h5ad
  11. Confirm no error message is shown, and "Save" button is enabled
  12. Upload anndata_test_bad_header_no_species.h5ad and anndata_test.h5ad to your study's GCP bucket
  13. Click "Use bucket path"
  14. Enter "anndata_test_bad_header_no_species.h5ad" in text input
  15. Click off input to blur / change browser focus
  16. Confirm no error message is shown (remote AnnData CSFV will be added next) and "Save" button is enabled

Further details

For more context, see 2024-08-13 SCP demo video. This satisfies SCP-5718.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 90.76923% with 6 lines in your changes missing coverage. Please review.

Project coverage is 69.83%. Comparing base (809a8ae) to head (c256082). Report is 29 commits behind head on development.

Files Patch % Lines
app/javascript/lib/validation/shared-validation.js 93.10% 2 Missing :warning:
app/javascript/lib/validation/validate-anndata.js 93.10% 2 Missing :warning:
...javascript/lib/validation/validate-file-content.js 60.00% 2 Missing :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112/graphs/tree.svg?width=650&height=150&src=pr&token=HMWE5BO2a4&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute)](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute) ```diff @@ Coverage Diff @@ ## development #2112 +/- ## =============================================== + Coverage 69.80% 69.83% +0.03% =============================================== Files 324 325 +1 Lines 27276 27319 +43 Branches 2252 2267 +15 =============================================== + Hits 19039 19079 +40 - Misses 8112 8115 +3 Partials 125 125 ``` | [Files](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute) | Coverage Δ | | |---|---|---| | [app/javascript/components/upload/upload-utils.js](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112?src=pr&el=tree&filepath=app%2Fjavascript%2Fcomponents%2Fupload%2Fupload-utils.js&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL2phdmFzY3JpcHQvY29tcG9uZW50cy91cGxvYWQvdXBsb2FkLXV0aWxzLmpz) | `91.15% <100.00%> (ø)` | | | [...script/components/validation/ValidationMessage.jsx](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112?src=pr&el=tree&filepath=app%2Fjavascript%2Fcomponents%2Fvalidation%2FValidationMessage.jsx&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL2phdmFzY3JpcHQvY29tcG9uZW50cy92YWxpZGF0aW9uL1ZhbGlkYXRpb25NZXNzYWdlLmpzeA==) | `96.29% <ø> (ø)` | | | [app/javascript/lib/validation/validate-file.js](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112?src=pr&el=tree&filepath=app%2Fjavascript%2Flib%2Fvalidation%2Fvalidate-file.js&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL2phdmFzY3JpcHQvbGliL3ZhbGlkYXRpb24vdmFsaWRhdGUtZmlsZS5qcw==) | `67.07% <100.00%> (ø)` | | | [app/javascript/lib/validation/shared-validation.js](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112?src=pr&el=tree&filepath=app%2Fjavascript%2Flib%2Fvalidation%2Fshared-validation.js&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL2phdmFzY3JpcHQvbGliL3ZhbGlkYXRpb24vc2hhcmVkLXZhbGlkYXRpb24uanM=) | `95.48% <93.10%> (-0.67%)` | :arrow_down: | | [app/javascript/lib/validation/validate-anndata.js](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112?src=pr&el=tree&filepath=app%2Fjavascript%2Flib%2Fvalidation%2Fvalidate-anndata.js&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL2phdmFzY3JpcHQvbGliL3ZhbGlkYXRpb24vdmFsaWRhdGUtYW5uZGF0YS5qcw==) | `93.10% <93.10%> (ø)` | | | [...javascript/lib/validation/validate-file-content.js](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112?src=pr&el=tree&filepath=app%2Fjavascript%2Flib%2Fvalidation%2Fvalidate-file-content.js&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL2phdmFzY3JpcHQvbGliL3ZhbGlkYXRpb24vdmFsaWRhdGUtZmlsZS1jb250ZW50Lmpz) | `88.06% <60.00%> (-1.49%)` | :arrow_down: | ... and [7 files with indirect coverage changes](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2112/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute)
eweitz commented 2 months ago

Thanks for the catch. Adaptations I made for the unit test broke the UX -- a downside of discrepant web and Node APIs.

I've fixed the UX and ensured the unit test still works, and added another unit test to help ensure soundness, and refined the manual test steps. The 2 unit tests pass, and the 3 cases outlined in the manual test steps also work for me now.

eweitz commented 2 months ago

Thanks for the reviews!

When I run yarn ui-test

All the CI test runs pass, and my local tests via yarn test pass (modulo two pre-existing local-only unrelated failures). So I think your hunch is right this is an issue on your end. Ping me if you'd like to pair on it!

When CSFV errors are encountered, the CSFV error reports the most recently chosen file but the form itself retains the previous filename

Useful find! It turns out this is reproducible already on staging, and affects that UX flow for all file types. So this is a pre-existing issue. I agree it's confusing when it occurs. Its incidence is likely very low, as selecting a valid file then an invalid file via "Replace" seems rare. Fixing this while we're refining the upload UI might be opportune, though. I've opened SCP-5771 so we can assess priority next sprint.