chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
62 stars 12 forks source link

Bug: Staging env has broken public dataset download links #7318

Open nayib-jose-gloria opened 2 months ago

nayib-jose-gloria commented 2 months ago

Example of dataset download link returning 403 in staging (but not prod): https://datasets.cellxgene.staging.single-cell.czi.technology/1e25d3e2-e3e7-49c6-a543-f378c15bfb8f.h5ad

Other datasets whose artifacts have this issue in staging: 2104fbb8-8ce3-4740-8b6a-bcbb46a13c0f, ff12e239-9292-4d25-bb0d-e4509b3bd92b

Early investigation shows this dataset artifact is not in the staging s3 bucket where we host those public dataset assets; its possible that this is the result of an early-terminated mirroring job (we have a script to mirror prod db + assets to staging, which can be locally run by engineers. It mirrors the DB first, then the assets.)

To confirm, we should download H5AD + strip labels with the cellxgene-schema CLI + reupload the H5AD in Staging and check whether the dataset download link now works.

Then, test that editing the dataset title and changing the DOI in the UI (both actions trigger a dataset update) does not cause the asset download link to start failing.

Finally, run the mirroring script (make mirror_env_data DEST_ENV=staging in single-cell-data-portal/backend) from prod -> staging and ensure it runs completely. Check whether this fixes the asset download links for the listed datasets above.

If any step above results in a broken download link, create a follow-up issue to investigate the bug and prioritize as a p0 as it may be affecting prod.