Open david4096 opened 8 years ago
I'm working on ingesting the latest treehouse reference which includes the recompute data as a subset. We can leverage that towards this issue next week.
Current issues: @rcurrie @saupchurch
Nice to haves:
description
values in the biosample and individual messages:"description": "SAMN03878301THE BROAD INSTITUTESRX1125092TRANSCRIPTOMIC2015-09-11111582015-10-06SRR2135383SRS1017226GTEX-U8XE-2026-SM-5CHQF_rep1RNA:Total RNAPancreas1664110706569PancreasNoRNA_Seq (NGS)malePRJNA75899Homo sapiensILLUMINASRP012682phs000424GTExGTEX-U8XE",
To clarify, we might improve the appearance of the description field by adding space separators. So:
SAMN03878301THE BROAD INSTITUTESRX1125092TRANSCRIPTOMIC2015-09-11111582015-10-06SRR2135383SRS1017226GTEX-U8XE-2026-SM-5CHQF_rep1RNA:Total RNAPancreas1664110706569PancreasNoRNA_Seq (NGS)malePRJNA75899Homo sapiensILLUMINASRP012682phs000424GTExGTEX-U8XE
Becomes (guessing based on appearance):
SAMN03878301 THE BROAD INSTITUTES RX1125092 TRANSCRIPTOMIC 2015-09-11 111582015-10-06 SRR2135383SRS1017226 GTEX-U8XE-2026-SM-5CHQF _rep1RNA:Total RNA Pancreas 1664110706569 Pancreas NoRNA_Seq (NGS)malePRJNA75899Homo sapiens ILLUMINA SRP012682 phs000424 GTEx GTEX-U8XE
If this takes more than adding a couple of
to an ETL script, it probably isn't worth it.
To further clarify, the individualId
on biosample messages appears to be malformed:
"individualId": "['f2a63c66-301e-4622-9a3f-43c5cfee79f3']",
It should be a base64 ID of an individual which resolves in the server.
Looks like the individualId is set on biosamples and the biosampleId is set on RNA quants for Target data. Thanks @ejacox
Long winded description of a simple data curation problem
tl;dr remove some keys from the individual metadata info messages to be more representative of the individual as opposed to the sample.
Just took a look at the gtex data curated for gtex and noticed something a bit odd.
I grabbed an RNA quant from GTEX RSEM genes quantification set. link
I then followed the biosampleID to get this biosample. This biosample's name matches what is on the RNA quantification. The tag BioSample_s
has a value of SAMN02793620
and bioproject PRJNA75899
in the info message. The tissue is listed by body_site_s
as Adipose - Visceral (Omentum)
.
I then grabbed the individual by ID from that biosample here and noticed that there was detail about the body_site_s
in the individual message that didn't match my Adipose
, instead for the individual it says Pancreas
.
Performing a SearchBiosamplesRequest using the individual ID revealed there are a number of samples from this individual. In order to generate the individual messages we are parsing the biosample table, I believe, and the first biosample for an individual probably becomes the unique individual record.
If we want to keep that pattern, we ought to remove the keys that are specific to a sample from the individual message.
Stand up a server with the RNA recompute data.
s3://cgl-rnaseq-recompute-fixed
Assemble metadata preparation for tcga and GTEX