d3b-center / ticket-tracker-OPC

A repo to generate and track tickets for ped OT
2 stars 0 forks source link

update 2 DGD BS_ids in MAF / make sure MAF is gzipped #347

Closed jharenza closed 2 years ago

jharenza commented 2 years ago

What data file(s) does this issue pertain to?

snv-dgd.maf.tsv

What release are you using?

v11

Put your question or report your issue here.

  1. upon download, the MAF is only a TSV file, not gzipped
  2. When adding DGD ids to the v11 histologies file, we were using an outdated manifest with old BS_ids. This was resulting in a loss of 112 BS_ids. This has been updated in 5c455a4 by using the data warehouse as the source of truth for DGD ids. However, two BS_ids in the current maf mentioned above ("BS_YWAMZMGF" "BS_NEW113J5") are still presumably the old IDs, as they are not in the data warehouse. Please replace with the updated IDs as per the genomic file manifest found below:
select * from bix_workflows.dgd_genomics_file_manifest
jharenza commented 2 years ago

@zhangb1 can you take a look at this please

zhangb1 commented 2 years ago

humm, just check the v11 folder:

s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v11/snv-dgd.maf.tsv.gz

this file is gizpped. or I miss something?

jharenza commented 2 years ago

It looks like it has the .gz extension yes, but when you download, it saves as TSV only. But, what I meant was we need to update the two BS_ids above. Can you check on that please?

zhangb1 commented 2 years ago

^ @HuangXiaoyan0106

HuangXiaoyan0106 commented 2 years ago

@jharenza I have checked the bix_workflows.dgd_genomics_file_manifest, these two BS_ids are still in the table. And I didn't see any update ids for these two maf files(ET_6TJ718RG_DGD.vep.maf,ET_2VMXCM6Y_DGD.vep.maf). Or did I do the wrong check?

check_id_results.csv

SELECT * FROM bix_workflows.dgd_genomics_file_manifest WHERE biospecimen_id='BS_YWAMZMGF'
SELECT * FROM bix_workflows.dgd_genomics_file_manifest WHERE biospecimen_id='BS_NEW113J5'

OR

SELECT * FROM bix_workflows.dgd_genomics_file_manifest WHERE file_name='ET_6TJ718RG_DGD.vep.maf'
SELECT * FROM bix_workflows.dgd_genomics_file_manifest WHERE file_name='ET_2VMXCM6Y_DGD.vep.maf'
jharenza commented 2 years ago

Hmm, @nicholasvk can you check as to why these two are in the file but not in the data warehouse view please?

nicholasvk commented 2 years ago

There are 2 GENIE records that were not mapped at the time of our workflow development efforts with DGD. They have the old external sample ID format C ID + a sequential number vs. the new format where we associated DGD clinical assays to diagnoses captured in the DGD REDCap project. These must not have mapped at the time and we would need to revisit to see if they can be mapped. I think until they are mapped it makes sense to have them excluded from the PBTA / OT workflow. They are already not being included in the histologies file, not sure what the implications of removing them from the maf file would be. Would GENIE researchers be using this?

jharenza commented 2 years ago

Ok, no problem- we can remove them from the MAF - @runjin326 can you do this please?

I don't know if GENIE users are using this at all, so fine to exclude. Thanks for looking into this @nicholasvk

runjin326 commented 2 years ago

@jharenza - updated and uploaded to s3 - also updated md5sum.txt.

jharenza commented 2 years ago

Thanks @runjin326 !