cBioPortal / datahub

A centralized location for storing curated data from cBioPortal
168 stars 119 forks source link

NA in REFERENCE_ALLELE or TUMOR_SEQ_ALLELE #465

Closed jjgao closed 4 years ago

jjgao commented 5 years ago

There are many studies with NA in either REFERENCE_ALLELE or TUMOR_SEQ_ALLELE Our annotator seems to be able to handle NA in REFERENCE_ALLELE but not in TUMOR_SEQ_ALLELE

image

image

We should fix them. If NA means - (an insertion or deletion), let follow MAF spec and use -. @sandertan is there a rule in validator to enforce this?

select DISTINCT gp.stable_id
from mutation_event me, mutation m, genetic_profile gp
where me.mutation_event_id=m.mutation_event_id and m.genetic_profile_id=gp.genetic_profile_id
AND MUTATION_TYPE!="FUSION"
and (REFERENCE_ALLELE="NA" or TUMOR_SEQ_ALLELE="NA");

@ritikakundra: please also run the sql in private databases and fix the data.

ritikakundra commented 5 years ago

@jjgao Checked the private instance and majority of these are not part of it. Rebuild of the public database will resolve almost all of this. We have a few from the private portal that we can look into. Will add the final list here

sandertan commented 5 years ago

@jjgao I think the --strict_maf_checks mode in the validator check this. @dionnezaal added some nice validation tests that follow the documented MAF format. Currently, when a PR with new study data or during the weekly check, this mode is not enabled. I can't remember why we disabled it, but if this format is a priority, we can enable it again. Perhaps it's best to first run validateStudies.py with --strict_maf_checks on all studies locally to see what warnings/errors we get from it.

Btw, NCI GDC is updating their MAF documentation soon! I've already received a Word doc with what they propose to publish.

jjgao commented 5 years ago

Thanks, @sandertan!

yichaoS commented 4 years ago

@jjgao @ritikakundra Updated list from public DB: acc_2019_mutations brca_igr_2015_mutations crc_msk_2017_mutations ov_tcga_pan_can_atlas_2018_mutations pediatric_dkfz_2017_mutations (is being fixed this week)

sbabyanusha commented 4 years ago

@jjgao @ritikakundra We are dealing with this issue and it is associated with the mutated fix #114. Done resolving this issue for most the studies.

yichaoS commented 4 years ago

There are only 3 public studies having this 'NA' issue:

yichaoS commented 4 years ago

@jjgao Our validator currently are allowing alleles to be blank/NA and importers are still importing these variants.