DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
5 stars 2 forks source link

Enable DUOS dataset descriptions in `anvilprod` #6119

Closed bvizzier-ucsc closed 1 month ago

bvizzier-ucsc commented 2 months ago

In the validation of anvil5, populated dataset descriptions are not being displayed. This document contains two examples.

hannes-ucsc commented 2 months ago

Could you please post the examples verbatim here? I would do it but I don't have access to the doc.

It would also be good to ask the Broad team to check this on their end, first. If this were a bug in Azul, we would show no description for any of the datasets. The fact that only some datasets miss the description suggest that this is a bug on their end.

bvizzier-ucsc commented 2 months ago

Here is the content of the document:

Examples of datasets in Data Explorer that we expect to have descriptions but we aren’t seeing it:

ANVIL_GTEx_BCM_GRU_CoRSIVs Source Snapshot Link Data Explorer Page Link DUOS ID: DUOS-000158 Description in DUOS: "description": "Methylation of cytosines in CpG dinucleotides is an epigenetic mechanism with essential roles in mammalian development. To explore its functions in cellular differentiation, unbiased analysis of CpG methylation by whole genome bisulfite sequencing (WGBS) has been used to characterize epigenetic differences among different human tissues and cell types. Meanwhile, human interindividual variation in DNA methylation that is not cell-type specific has attracted relatively little attention. Systemic interindividual epigenetic variation is important, however, because like genetic variation it is a potential determinant of phenotypic variation and can be assessed in any easily obtainable DNA sample. Since systemic epigenetic variants originate in the preimplantation embryo, their establishment can be influenced by periconceptional environment, and potentially provide information about lifetime risks relevant to global health, obesity, and cancer. We elucidated systemic interindividual variation in CpG methylation in the human genome. We studied brain, heart, and thyroid tissues (representing all three germ layer lineages) from each of 10 donors in the NIH Gene-Tissue Expression (GTEx) project. We performed deep whole genome bisulfite sequencing for these 30 samples, achieving a sequencing depth of over 50x coverage per sample. We identified 9,926 correlated regions of systemic interindividual variation (CoRSIVs). These regions, comprising just 0.1% of the human genome, often correlate with one another over long genomic distances, are associated with transposable elements and subtelomeric regions, conserved across various human ethnic groups, and particularly sensitive to periconceptional environment. While genetic variation appears to influence methylation at most CoRSIVs, many show no evidence of genetic influence, suggestive of 'pure' epigenetic variation. At CoRSIVs, interindividual variation in DNA methylation in an easily biopsied tissue predicts expression in other tissues, and genes associated with these loci are implicated in a range of human disorders. In addition to charting a previously unrecognized molecular level of human individuality, this atlas of human CoRSIVs provides a resource for future population-based investigations into how interindividual epigenetic variation modulates risk of disease. Platform: AnVIL"

ANVIL_NIA_CARD_Coriell_Cell_Lines_Open Source Snapshot Link Data Explorer Page Link DUOS ID: DUOS-000243 Description in DUOS: "description": "Here we report and share Oxford Nanopore Technologies (ONT) data from three widely used lymphobastoid cell lines (LCLs) (HG002, HG02723 and HG00733) obtained from Coriell (https://www.coriell.org/). This data was generated during optimizing the ONT sequencing protocol for the long-read sequencing efforts of the NIH Intramural Center for Alzheimer's and Related Dementias (CARD, https://card.nih.gov/). The overall aim is to generate high quality long-read data with N50s of ~30kb resulting in >30X coverage. For more information see https://anvil.terra.bio/#workspaces/anvil-datastorage/ANVIL_NIA_CARD_Coriell_Cell_Lines_Open.\nPlatform: AnVIL"

hannes-ucsc commented 2 months ago

Turns out we never enabled this feature in anvilprod. We only enabled it in anvildev because the initial round of dataset descriptions were only provided on the development instances of TDR/DUOS and I believe we weren't informed about the availability of dataset descriptions in the production instances. IOW, none of the datasets on anvilprod have a description from DUOS.

We'd need the URL of the DUOS production instance, i.e. the production equivalent of the development URL which is https://consent.dsde-dev.broadinstitute.org, and then we can enable it and kick off a reindex. Additionally, the members of the azul-anvil-prod Terra group would need to be given permission to hit the DUOS API.

hannes-ucsc commented 2 months ago

We'd need the URL of the DUOS production instance, i.e. the production equivalent of the development URL which is https://consent.dsde-dev.broadinstitute.org, and then we can enable it and kick off a reindex. Additionally, the members of the azul-anvil-prod Terra group would need to be given permission to hit the DUOS API.

Assignee to coordinate with Broad.

hannes-ucsc commented 2 months ago

Broad is deciding whether to grant the required permissions or whether to drop the need for authorization on the DUOS endpoint(s) that we use.

https://ucsc-gi.slack.com/archives/C03TPJS54DC/p1712270266012199

hannes-ucsc commented 2 months ago

Pinged Broad just now

hannes-ucsc commented 2 months ago

Spike to verify that DUOS code change doesn't break Azul.

dsotirho-ucsc commented 2 months ago

Dataset descriptions are still present after a reindex on anvildev.

$ curl 'https://service.anvil.gi.ucsc.edu/index/datasets?catalog=anvil&size=100' | jq '.hits[].datasets[] | {"title": .title, "description": .description}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22869  100 22869    0     0  29089      0 --:--:-- --:--:-- --:--:-- 29058
{
  "title": "ANVIL_1000G_2019_Dev",
  "description": null
}
{
  "title": "ANVIL_CCDG_Sample_1",
  "description": "GTEx test study description content"
}
{
  "title": "ANVIL_CMG_Sample_1",
  "description": "This is an example of a description for GREGoR, for UCSC testing purposes "
}
dsotirho-ucsc commented 2 months ago

@hannes-ucsc: "Now waiting for change to land in DUOS production."

hannes-ucsc commented 2 months ago

We were informed that it is in production.

https://ucsc-gi.slack.com/archives/C03TPJS54DC/p1712771474948909?thread_ts=1712270266.012199&cid=C03TPJS54DC

hannes-ucsc commented 2 months ago

Assignee to enable the indexing of dataset descriptions from DUOS. The production instance of DUOS is https://consent.dsde-prod.broadinstitute.org/

hannes-ucsc commented 2 months ago

For demo, show dataset descriptions in anvilprod.