fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
315 stars 67 forks source link

HapCutToVcf error: issues with BLOCK separators #541

Closed mdem7705 closed 4 years ago

mdem7705 commented 5 years ago

Hey guys,

I am getting this error from my HapCUT2 output:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Did not parse at least 11 fields (parsed 1) in the HapCut2 line: ********
    at scala.Predef$.require(Predef.scala:281)
    at com.fulcrumgenomics.vcf.HapCutCall$.toHapCutCall(HapCutToVcf.scala:616)
    at com.fulcrumgenomics.vcf.HapCutReader.forNextBlockCalls(HapCutToVcf.scala:766)
    at com.fulcrumgenomics.vcf.HapCutReader.advance(HapCutToVcf.scala:780)
    at com.fulcrumgenomics.vcf.HapCutReader.hasNext(HapCutToVcf.scala:725)
    at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.next(HapCutToVcf.scala:318)
    at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.next(HapCutToVcf.scala:299)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.foreach(HapCutToVcf.scala:299)
    at com.fulcrumgenomics.vcf.HapCutToVcf.execute(HapCutToVcf.scala:139)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:99)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:80)
    at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:48)
    at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)

It seems like there is an issue processing the delimiter line between my blocks, viewed here in a section of my hapCUT2 output file:

164     1       0       Sc0000001       54855   C       A       0/1:68,13:81:99:118,0,1982      0       100.00  100.00  93
165     1       0       Sc0000001       54887   G       T       0/1:58,8:66:54:54,0,1750        0       100.00  100.00  86
169     1       0       Sc0000001       54947   C       T       0/1:30,19:49:99:435,0,761       0       100.00  100.00  88
******** 
BLOCK: offset: 174 len: 132772 phased: 113213 SPAN: 6711882 fragments 118552
174     1       0       Sc0000001       88635   G       A       0/1:3,7:10:48:146,0,48  0       100.00  100.00  98
175     1       0       Sc0000001       88697   C       T       0/1:3,4:7:57:83,0,57    0       100.00  100.00  103
176     1       0       Sc0000001       88786   T       G       0/1:1,2:3:17:81,0,17    0       100.00  100.00  91

I could easily edit out the problem line, but I think this may be a symptom of a larger problem so I'd rather check first.

Here is the code I am trying to run:

module load hapcut/2-1.1

extractHAIRS --pacbio 1 --bam Sc0000001.bam --ref Sc0000001.fa --VCF Sc0000001.vcf --out Sc0000001.frags
HAPCUT2 --ea 1 --fragments Sc0000001.frags --VCF Sc0000001.vcf --output Sc0000001.blocks

python3.7 /usr/local/hapcut/2/utilities/prune_haplotype.py -I Sc0000001.blocks -o Sc0000001.pruned --min_mismatch_qual 30 --min_switch_qual 30

java -jar /usr/local/fgbio/0.9.0-snapshot/fgbio-0.9.0-snapshot.jar HapCutToVcf -v  Sc0000001.vcf -I Sc0000001.pruned -o Sc0000001_phased.vcf 

Here are my fgbio input files: hc2vcf.zip And the version of HapCUT I am using is here: https://github.com/vibansal/HapCUT2/releases/tag/5a6f5df

Can you please help?

Thank you!

nh13 commented 5 years ago

@mdem7705 I think the prune_haplotype.py script is causing the issue, since it is no longer in the format we expect. For example this block expects 113,213 phased variants out of 132,772:

BLOCK: offset: 174 len: 132772 phased: 113213 SPAN: 6711882 fragments 118552

but there are only 88,999 variants (lines) after this block definition. You'll see things like:

********
BLOCK: (from split)

Ideally, the script would output in the same format as the inputs. Can you contact the HapCut2 authors (ex. @pjedge) to see if they can modify the script? I'd rather not deal with yet another format.

nh13 commented 4 years ago

Closing as this is an issue in Hapcut's script