arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 118 forks source link

is_somatic column is incorrectly flagged as True for all VarScan somatic produced output #260

Closed lbeltrame closed 10 years ago

lbeltrame commented 10 years ago

Unlike MuTect or other software, VarScan (when doing somatic mutation calling) adds a SS (Somatic Status) field to the VCFs it produces, using this code:

To make matters complicated, it's a number, but the type in the INFO field is "String".

However in gemini all variants are flagged with is_somatic=1, potentially causing false positives. Ideally, gemini should only tag with is_somatic only variants with SS=2. I understand this needs to be fixed at the VarScan level, but given the pace of development I'm doubtful it will happen any time soon.

arq5x commented 10 years ago

I see where you are coming from, but I am a bit hesitant to go deeper down the road of supporting each variant caller's use of VCF. That said, the is_somatic column is intended to capture the variant caller's prediction and there is no standard (that I know of) for representing this in the INFO column. Would you be interested in adding this functionality and providing a pull request?

lbeltrame commented 10 years ago

Likely yes, that's why I asked. ;) Where should I look for to see how gemini does somatic parsing?

arq5x commented 10 years ago

Absolutely. The logic is here in the infotag.py file.

https://github.com/arq5x/gemini/blob/master/gemini/infotag.py#L65-L69

ckandoth commented 10 years ago

Most somatic variant callers, including MuTect and VarScan, follow the TCGA/ICGC standard of inserting a tag named SOMATIC in the VCF's INFO column. Gemini's infotag.py appears to be handling this correctly, so this issue can be closed.

lbeltrame commented 10 years ago

Most somatic variant callers, including MuTect and VarScan, follow the

VarScan has some issues in flagging the content correctly (I get also records with somatic status != 2), however indeed this can be closed as I'll handle this at the point of data generation (inserting SOMATIC flags where necessary).

arq5x commented 10 years ago

Thanks for the update. Closing.