broadinstitute / gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Other
31 stars 4 forks source link

New maf datatypes need to be recognized #68

Open gsaksena opened 6 years ago

gsaksena commented 6 years ago

There should be new datatypes for somaticsniper, muse, and varscan maf files.

The following commands:

gdc_mirror --config tcgaSmoketest.cfg gdc_dice --config tcgaSmoketest.cfg

yield the following output: ... 2017-10-28 19:31:23,568[INFO]: Writing sample MAF: TCGA-D3-A3C7-06A-11D-A196-08.7d62b913-4ae8-4c59-9819-71711f12b3b2.maf.txt 2017-10-28 19:31:23,591[INFO]: Writing sample MAF: TCGA-EE-A3J8-06A-11D-A20D-08.7d62b913-4ae8-4c59-9819-71711f12b3b2.maf.txt 2017-10-28 19:31:23,616[WARNING]: Unrecognized data: { "file_name": "TCGA.SKCM.somaticsniper.8cce7734-539b-4fba-bf9a-69735906d962.DR-7.0.somatic.maf.gz", "data_category": "Simple Nucleotide Variation", "data_type": "Masked Somatic Mutation", "file_id": "8cce7734-539b-4fba-bf9a-69735906d962" } 2017-10-28 19:31:23,616[WARNING]: Unrecognized data: { "file_name": "TCGA.SKCM.muse.a1fe3943-5377-4763-8494-5e4e61545820.DR-7.0.somatic.maf.gz", "data_category": "Simple Nucleotide Variation", "data_type": "Masked Somatic Mutation", "file_id": "a1fe3943-5377-4763-8494-5e4e61545820" } 2017-10-28 19:31:23,617[WARNING]: Unrecognized data: { "file_name": "TCGA.SKCM.varscan.e751c317-d661-4290-b755-2b5c4d9cd0a4.DR-7.0.somatic.maf.gz", "data_category": "Simple Nucleotide Variation", "data_type": "Masked Somatic Mutation", "file_id": "e751c317-d661-4290-b755-2b5c4d9cd0a4" } ...

gsaksena commented 6 years ago

Note that all of the current regression tests pass in spite of this error being thrown. I'm not sure whether this is or is not desirable.

dheiman commented 6 years ago

This was by choice and design, we are currently only using mutect mafs.

These are warnings, not errors. Data types that are not specified are not diced, and a warning is given for them in the log.

gsaksena commented 6 years ago

In the current branch code, a file dicing can be labeled pass, error, cached, and dry_run. And, missing datatypes are being flagged as error. I've also added the new datatypes in this commit:

https://github.com/broadinstitute/gdctools/pull/67/commits/613c77c311f9484eb0bc9d09b9d8b7d95e4c9f0c

It sounds like you would suggest backing out this commit, and adding a new file dicing status of 'unrecognized'.