BD2KGenomics / ga4gh-integration-deprecated

Tracking for ga4gh-integration projects
1 stars 2 forks source link

GA4GH RNA Recompute Server #36

Open david4096 opened 8 years ago

david4096 commented 8 years ago

Stand up a server with the RNA recompute data.

s3://cgl-rnaseq-recompute-fixed

Assemble metadata preparation for tcga and GTEX

rcurrie commented 7 years ago

I'm working on ingesting the latest treehouse reference which includes the recompute data as a subset. We can leverage that towards this issue next week.

david4096 commented 7 years ago

Current issues: @rcurrie @saupchurch

Nice to haves:

"description": "SAMN03878301THE BROAD INSTITUTESRX1125092TRANSCRIPTOMIC2015-09-11111582015-10-06SRR2135383SRS1017226GTEX-U8XE-2026-SM-5CHQF_rep1RNA:Total RNAPancreas1664110706569PancreasNoRNA_Seq (NGS)malePRJNA75899Homo sapiensILLUMINASRP012682phs000424GTExGTEX-U8XE",
david4096 commented 7 years ago

To clarify, we might improve the appearance of the description field by adding space separators. So:

SAMN03878301THE BROAD INSTITUTESRX1125092TRANSCRIPTOMIC2015-09-11111582015-10-06SRR2135383SRS1017226GTEX-U8XE-2026-SM-5CHQF_rep1RNA:Total RNAPancreas1664110706569PancreasNoRNA_Seq (NGS)malePRJNA75899Homo sapiensILLUMINASRP012682phs000424GTExGTEX-U8XE

Becomes (guessing based on appearance):

SAMN03878301 THE BROAD INSTITUTES RX1125092 TRANSCRIPTOMIC 2015-09-11 111582015-10-06 SRR2135383SRS1017226 GTEX-U8XE-2026-SM-5CHQF _rep1RNA:Total RNA Pancreas 1664110706569 Pancreas NoRNA_Seq (NGS)malePRJNA75899Homo sapiens ILLUMINA SRP012682 phs000424 GTEx GTEX-U8XE

If this takes more than adding a couple of to an ETL script, it probably isn't worth it.

david4096 commented 7 years ago

To further clarify, the individualId on biosample messages appears to be malformed:

"individualId": "['f2a63c66-301e-4622-9a3f-43c5cfee79f3']", It should be a base64 ID of an individual which resolves in the server.

david4096 commented 7 years ago

Looks like the individualId is set on biosamples and the biosampleId is set on RNA quants for Target data. Thanks @ejacox

david4096 commented 7 years ago

Long winded description of a simple data curation problem

tl;dr remove some keys from the individual metadata info messages to be more representative of the individual as opposed to the sample.

Just took a look at the gtex data curated for gtex and noticed something a bit odd.

I grabbed an RNA quant from GTEX RSEM genes quantification set. link

I then followed the biosampleID to get this biosample. This biosample's name matches what is on the RNA quantification. The tag BioSample_s has a value of SAMN02793620 and bioproject PRJNA75899 in the info message. The tissue is listed by body_site_s as Adipose - Visceral (Omentum).

I then grabbed the individual by ID from that biosample here and noticed that there was detail about the body_site_s in the individual message that didn't match my Adipose, instead for the individual it says Pancreas.

Performing a SearchBiosamplesRequest using the individual ID revealed there are a number of samples from this individual. In order to generate the individual messages we are parsing the biosample table, I believe, and the first biosample for an individual probably becomes the unique individual record.

If we want to keep that pattern, we ought to remove the keys that are specific to a sample from the individual message.