HumanCellAtlas / dcp2

Shared artifacts concerning the Human Cell Atlas (HCA) Data Coordination Platform (DCP)
4 stars 2 forks source link

TDR `dev` dataset is stale #17

Closed hannes-ucsc closed 2 years ago

hannes-ucsc commented 3 years ago

Rather than patching

https://github.com/DataBiosphere/azul/issues/2873 https://github.com/DataBiosphere/azul/issues/2870

we think that it's time to repopulate the dev dataset from scratch with a subset of projects from the dcp4 catalog used in prod. This would 1) reduce the size of the dev catalog and 2) make sure it is more representative of the current production systems. For example, the current dev snapshot does not have any intact analysis subgraphs or DCP/2-generated matrices.

hannes-ucsc commented 3 years ago

Step one is to settle on the subset of projects, more specifically staging areas, that we should use to populate the dataset from.

kbergin commented 3 years ago

Sounds good to me.. if you want to just choose one 10X project that has analysis results and project matrices that would work. And then whatever we chose previously for having a small SS2 dataset and ideally at least one larger one so that we can test scale a bit more.

hannes-ucsc commented 3 years ago

Could you link or name the specific projects?

kbergin commented 3 years ago

This would be a good 10X project: 559bb888-7829-41f2-ace5-2c05c7eb81e9 This one for SS2 is already in dev and it's the one we've recently produced analysis results for: 8c3c290d-dfff-4553-8868-54ce45f4ba7f

We may want to keep the existing analysis results in dev for that SS2 project, though we may end up replacing them with our second iteration that should fix the project matrices that didn't index properly. Lmk what you think.

jessebrennan commented 3 years ago

We would also like 5b5f05b7-2482-468d-b76d-8f68c04a7a47 (Substantia_nigra_and_locus_coeruleus) in order to validate our solution to https://github.com/DataBiosphere/azul/issues/3095.

hannes-ucsc commented 3 years ago

And https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6 since it's the only project with organically described CGMs.

hannes-ucsc commented 3 years ago

A consolidated list with all the projects mentioned above:

hannes-ucsc commented 3 years ago

The challenge is now to translate this to a list of staging areas. If that list ends up including a DSS adapter's staging area we will need to decide if we want to import that as is, thereby increasing the size of the new dataset, or if we want to create a stripped down copy of the staging area that only includes the mentioned projects. Another challenge is to match the projects between primary staging area and CGM staging areas.

aherbst-broad commented 3 years ago

Here are the staging areas:

hannes-ucsc commented 3 years ago

Here are the staging areas:

* From prod:

  * @kbergin (10x): https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6

    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/559bb888-7829-41f2-ace5-2c05c7eb81e9
    * CGMs

      * gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/liver_immune_prohect_annotations.txt
      * gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/normalised_expression_matrix.h5

What's the plan on only importing the relevant CGM data and metadata files? Will you create a stripped down SA for these?

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.

  * @hannes-ucsc (organic CGMs): https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6

    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8
    * CGMSs

      * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/annotation_200112.csv

Since these are organic CGMs that are in the same SA as the rest of the project, nothing special needs to be done here.

      * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/covid_portal.h5ad
  * @jessebrennan (large analysis subgraph): https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more that one.

    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/5b5f05b7-2482-468d-b76d-8f68c04a7a47
    * No CGMs

* From dev:

  * @kbergin (SS2): https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.

    * gs://broad-dsp-monster-hca-prod-ucsc-storage/prod/no-analysis/metadata/project
  * @hannes-ucsc ("Lattice") https://dev.singlecell.gi.ucsc.edu/explore/projects/f0f89c14-7460-4bab-9d42-22228a91f185

    * gs://broad-dsp-monster-hca-dev-lattice/staging/f0f89c14-7460-4bab-9d42-22228a91f185
aherbst-broad commented 3 years ago

@hannes-ucsc re: your first comment, yes, we will be creating a stripped down SA. Re: the analysis files, good catch, I will attempt to dig them up.

hannes-ucsc commented 3 years ago

Cool. Let me know if you'd like us to help with any of that.

aherbst-broad commented 3 years ago

Thanks @hannes-ucsc .

For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them.

Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single /data with no associated /metadata, /links or /descriptors directory. My hunch is that those sub-folders got removed at some point between our dev testing and now. Is there another potential project that would fit the bill? If not, we can likely reconstruct out of bigquery or potentially ask lattice to re-stage, but that will take a bit of work as well.

These are the related analysis files for the other 3 projects:

hannes-ucsc commented 3 years ago

For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them.

We usually use a BQ query like this one:

select 
    json_extract(analysis_file.content, "$.file_core.file_name") as file_name
from `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.links` as links
join unnest(json_extract_array(links.content, '$.links')) as content_links 
    on json_extract_scalar(content_links, '$.link_type') = 'process_link'
join unnest(json_extract_array(content_links, '$.outputs')) as outputs
    on json_extract_scalar(outputs, '$.output_type') = 'analysis_file'
join `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.analysis_file` as analysis_file
    on json_extract_scalar(outputs, '$.output_id') = analysis_file.analysis_file_id
where project_id = '8c3c290d-dfff-4553-8868-54ce45f4ba7f'
limit 100

Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single /data with no associated /metadata, /links or /descriptors directory. My hunch is that those sub-folders got removed at some point between our dev testing and now. Is there another potential project that would fit the bill? If not, we can likely reconstruct out of bigquery or potentially ask lattice to re-stage, but that will take a bit of work as well.

@theathorn can we ask the Stanford folks to repopulate the staging area? I have the feeling this is a temporary condition.

These are the related analysis files for the other 3 projects:

* https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6

  * Analysis files:
    * gs://fc-ece18604-add9-45db-9bc2-78c77e471f71/staging/data/liver-immune-cells-human-blood-10XV2.loom
    * gs://fc-58886c47-be1c-41f7-8d8b-df18aef417cc/staging/data/liver-immune-cells-human-liver-10XV2.loom

* https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6

  * Analysis files:
    * None

* https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6
  * Analysis files:
  * gs://fc-239060e3-44ef-4bfb-93f5-b388950be17e/staging/data/substantia-negra-human-brain-10XV2-nuclei.loom

There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities.

Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak?

aherbst-broad commented 3 years ago

@hannes-ucsc , thanks.

There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities.

Ah, thanks, we'll take that into account and update the list.

Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak?

We do not have that ability. However, we will need to build such a thing for the TDR prod migration that we'll be speaking about.

hannes-ucsc commented 3 years ago

The team discussed this today after the DCP demo. Consensus is that Monster implements tooling to selectively copy the meta(data) of individual projects between two TDR instances. This would eliminate the need to retain staging areas after they were imported, so that they might be imported again, a need we never actually specified. The ability to only copy selected projects would address the concern of dev being a costly 100% copy of prod.

hannes-ucsc commented 3 years ago

https://humancellatlas.slack.com/archives/C01360XN04S/p1631200873039200

theathorn commented 3 years ago

List of new dev snapshots.

melainalegaspi commented 3 years ago

Assignee to modify the configuration of the dev deployment to list the snapshots from the above Google sheet.

hannes-ucsc commented 2 years ago

For demo, show diversity of sources in service responses.