bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

VarScan's invalid VCF output (somatic caller) causes the pipeline to fail #134

Closed lbeltrame closed 11 years ago

lbeltrame commented 11 years ago

VarScan versions up to 2.3.5 have still bugs that result in improper VCFs being generated (invalid indels, like GA/+A).

For reference: http://sourceforge.net/p/varscan/discussion/1073559/thread/1be8f289/

Some of these bugs are fixed in 2.3.6 considering the GATK may error out when joining VCFs if using an older version, I'd make the latest version mandatory. However, 2.3.6 has still issues when the somatic caller returns variants of type "Unknown", because the REF field is invalid (example: GAA/A).

The plan is to notify upstream (see below) and at the same time workaround the issue until a fixed VarScan version is available.

The most recent issues with VarScan are mentioned in this thread: https://sourceforge.net/p/varscan/discussion/1073559/thread/7f070311/

lbeltrame commented 11 years ago

Actually the issue is not solved completely yet (at least in tumor / normal pairs). I guess some hacks will be needed (currently I can't analyze my dataset using VarScan due to this issue).

chapmanb commented 11 years ago

Luca; The automated installer uses 2.3.6 and I updated the check during running VarScan to ensure the latest is present. If you're still running into issues it would be worth reporting them to Dan, and we can work around it in the meantime. Here is the code used prior to 2.3.5 to fix index problems with non-tumor calling:

https://github.com/chapmanb/bcbio-nextgen/blob/98d33b4ea57bc057dd003e9c942c401cea2c3ee9/bcbio/variation/varscan.py#L44

lbeltrame commented 11 years ago

The problem is that I can't find the problematic file to see if the issue can be worked around, which generates this error (it's always line 19, meaning that's probably a VarScan issue)

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 19: Unparsable vcf record with allele A/C

or similar ones (I also have GA/+A) . They're generated during a transaction and my naive grepping doesn't seem to find them.

It may be worthy to reinstate the fix_indel_line function once I figure out how exactly the files are invalid.

chapmanb commented 11 years ago

Luca; You can look at the last VarScan command issued in log/bcbio-nextgen-commands.log and then manually re-run it from the command line after changing the output to not point to a transactional directory. That should help isolate the issue and let you generate a VCF to test with.

lbeltrame commented 11 years ago

directory. That should help isolate the issue and let you generate a VCF to test with.

I confirm the issue with indels at least (CT/-T in my sample test). I'll open a report for VarScan as well, but in the mean time I guess that at least for tumor/normal analysis we'll need the workarounds thrown back in.

I'll make a PR soon.

lbeltrame commented 11 years ago

Actually it's a bit more complex: VarScan reports "+T/-T" in its non VCF output, and in VCF it is CT/-T (the REF base is C). I have absolutely no idea if it means it is a deletion or an insertion.

EDIT: According to this post (http://seqanswers.com/forums/showpost.php?p=95103&postcount=8) the / should be replaced by "," as it indicates multiple alleles. So it's kind of different than the previous issue. However I'm still baffled on how the second "-T" would represent.

lbeltrame commented 11 years ago

This code (https://bitbucket.org/meltzerlab/seqtools/src/505952611251/seqtools/varscan.py) seems to do the job nicely, but it has no license: I'll contact the author to ensure we can add it.

lbeltrame commented 11 years ago

I think I almost got it to work... but still needs a few adjustments, I think.

lbeltrame commented 11 years ago

After a whole day fighting VarScan's broken output, the only solution I could find was completely discard "Unknown" (where VarScan is unable to identify whether it's LOH, somatic, germline, and it is not reference) variants, like complex adjustments (e.g. insertion in normal, deletion in tumor).

These have completely incorrect REF fields (in the form of CAA/A or GA/+A) and cause the GATK to choke on them. It's also not clear (to me at least) how could I fix them.

I've notified Dan, so hopefully this is a stopgap solution. I'm posting the patch here for discussion (I'm not sure if it warrants incusion, but I had to do this or the pipeline would keep on getting stuck with new errors the moment I tried to patch the existing ones).

diff --git a/bcbio/variation/varscan.py b/bcbio/variation/varscan.py
index 2a38321..d488e6f 100644
--- a/bcbio/variation/varscan.py
+++ b/bcbio/variation/varscan.py
@@ -158,6 +158,8 @@ def _fix_varscan_vcf(orig_file, in_bams):

                     for line in in_handle:
                         line = _fix_varscan_output(line, in_bams)
+                        if not line:
+                            continue
                         out_handle.write(line)

@@ -207,6 +209,8 @@ def _fix_varscan_output(line, in_bams):
     Ifreq = line[8].split(":").index("FREQ")
     ndat = line[9].split(":")
     tdat = line[10].split(":")
+    somatic_status = line[7].split(";")[1]  # SS=<number>
+    somatic_status = int(somatic_status.split("=")[1])  # Get the number
     ndat[Ifreq] = str(float(ndat[Ifreq].rstrip("%")) / 100)
     tdat[Ifreq] = str(float(tdat[Ifreq].rstrip("%")) / 100)
     line[9] = ":".join(ndat)
@@ -215,6 +219,12 @@ def _fix_varscan_output(line, in_bams):
     #FIXME: VarScan also produces invalid REF records (e.g. CAA/A)
     # This is not handled yet.

+    if somatic_status == 5:
+
+        # "Unknown" states are broken in current versions of VarScan
+
+        return
+
     if "+" in ALT or "-" in ALT:
         if "/" not in ALT:
             if ALT[0] == "+":
lbeltrame commented 11 years ago

One more reason to skip these broken lines is that the ALT alleles (usually more than one) have incorrect GT fields (because only one allele is counted while there are 2).

chapmanb commented 11 years ago

Luca -- thanks for all the work digging into this and for letting Dan know. I'd love to have a workaround in for now and we can remove it when VarScan handles these correctly.

lbeltrame commented 11 years ago

Looks like I missed this reply: I'll get a PR done soon (I am forced to run with a variation of that patch or my analyses wouldn't complete). I'll add the reference link in the report so that we can track the status.