Closed hannes-ucsc closed 2 years ago
Step one is to settle on the subset of projects, more specifically staging areas, that we should use to populate the dataset from.
Sounds good to me.. if you want to just choose one 10X project that has analysis results and project matrices that would work. And then whatever we chose previously for having a small SS2 dataset and ideally at least one larger one so that we can test scale a bit more.
Could you link or name the specific projects?
This would be a good 10X project: 559bb888-7829-41f2-ace5-2c05c7eb81e9 This one for SS2 is already in dev and it's the one we've recently produced analysis results for: 8c3c290d-dfff-4553-8868-54ce45f4ba7f
We may want to keep the existing analysis results in dev for that SS2 project, though we may end up replacing them with our second iteration that should fix the project matrices that didn't index properly. Lmk what you think.
We would also like 5b5f05b7-2482-468d-b76d-8f68c04a7a47
(Substantia_nigra_and_locus_coeruleus) in order to validate our solution to https://github.com/DataBiosphere/azul/issues/3095.
And https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6 since it's the only project with organically described CGMs.
A consolidated list with all the projects mentioned above:
From prod:
From dev:
The challenge is now to translate this to a list of staging areas. If that list ends up including a DSS adapter's staging area we will need to decide if we want to import that as is, thereby increasing the size of the new dataset, or if we want to create a stripped down copy of the staging area that only includes the mentioned projects. Another challenge is to match the projects between primary staging area and CGM staging areas.
Here are the staging areas:
Here are the staging areas:
* From prod: * @kbergin (10x): https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6 * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/559bb888-7829-41f2-ace5-2c05c7eb81e9 * CGMs * gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/liver_immune_prohect_annotations.txt * gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/normalised_expression_matrix.h5
What's the plan on only importing the relevant CGM data and metadata files? Will you create a stripped down SA for these?
We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.
* @hannes-ucsc (organic CGMs): https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6 * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8 * CGMSs * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/annotation_200112.csv
Since these are organic CGMs that are in the same SA as the rest of the project, nothing special needs to be done here.
* gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/covid_portal.h5ad * @jessebrennan (large analysis subgraph): https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6
We need to dig up the analysis SA(s) for these. Not sure if there is only one or more that one.
* gs://broad-dsp-monster-hca-prod-ebi-storage/prod/5b5f05b7-2482-468d-b76d-8f68c04a7a47 * No CGMs * From dev: * @kbergin (SS2): https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f
We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.
* gs://broad-dsp-monster-hca-prod-ucsc-storage/prod/no-analysis/metadata/project * @hannes-ucsc ("Lattice") https://dev.singlecell.gi.ucsc.edu/explore/projects/f0f89c14-7460-4bab-9d42-22228a91f185 * gs://broad-dsp-monster-hca-dev-lattice/staging/f0f89c14-7460-4bab-9d42-22228a91f185
@hannes-ucsc re: your first comment, yes, we will be creating a stripped down SA. Re: the analysis files, good catch, I will attempt to dig them up.
Cool. Let me know if you'd like us to help with any of that.
Thanks @hannes-ucsc .
For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them.
Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single /data
with no associated /metadata
, /links
or /descriptors
directory. My hunch is that those sub-folders got removed at some point between our dev testing and now. Is there another potential project that would fit the bill? If not, we can likely reconstruct out of bigquery or potentially ask lattice to re-stage, but that will take a bit of work as well.
These are the related analysis files for the other 3 projects:
For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them.
We usually use a BQ query like this one:
select
json_extract(analysis_file.content, "$.file_core.file_name") as file_name
from `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.links` as links
join unnest(json_extract_array(links.content, '$.links')) as content_links
on json_extract_scalar(content_links, '$.link_type') = 'process_link'
join unnest(json_extract_array(content_links, '$.outputs')) as outputs
on json_extract_scalar(outputs, '$.output_type') = 'analysis_file'
join `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.analysis_file` as analysis_file
on json_extract_scalar(outputs, '$.output_id') = analysis_file.analysis_file_id
where project_id = '8c3c290d-dfff-4553-8868-54ce45f4ba7f'
limit 100
Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single
/data
with no associated/metadata
,/links
or/descriptors
directory. My hunch is that those sub-folders got removed at some point between our dev testing and now. Is there another potential project that would fit the bill? If not, we can likely reconstruct out of bigquery or potentially ask lattice to re-stage, but that will take a bit of work as well.
@theathorn can we ask the Stanford folks to repopulate the staging area? I have the feeling this is a temporary condition.
These are the related analysis files for the other 3 projects:
* https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6 * Analysis files: * gs://fc-ece18604-add9-45db-9bc2-78c77e471f71/staging/data/liver-immune-cells-human-blood-10XV2.loom * gs://fc-58886c47-be1c-41f7-8d8b-df18aef417cc/staging/data/liver-immune-cells-human-liver-10XV2.loom * https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6 * Analysis files: * None * https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6 * Analysis files: * gs://fc-239060e3-44ef-4bfb-93f5-b388950be17e/staging/data/substantia-negra-human-brain-10XV2-nuclei.loom
There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities.
Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak?
@hannes-ucsc , thanks.
There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities.
Ah, thanks, we'll take that into account and update the list.
Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak?
We do not have that ability. However, we will need to build such a thing for the TDR prod migration that we'll be speaking about.
The team discussed this today after the DCP demo. Consensus is that Monster implements tooling to selectively copy the meta(data) of individual projects between two TDR instances. This would eliminate the need to retain staging areas after they were imported, so that they might be imported again, a need we never actually specified. The ability to only copy selected projects would address the concern of dev
being a costly 100% copy of prod
.
Assignee to modify the configuration of the dev
deployment to list the snapshots from the above Google sheet.
For demo, show diversity of sources in service responses.
Rather than patching
https://github.com/DataBiosphere/azul/issues/2873 https://github.com/DataBiosphere/azul/issues/2870
we think that it's time to repopulate the
dev
dataset from scratch with a subset of projects from thedcp4
catalog used in prod. This would 1) reduce the size of thedev
catalog and 2) make sure it is more representative of the current production systems. For example, the currentdev
snapshot does not have any intact analysis subgraphs or DCP/2-generated matrices.