KolmogorovLab / hapdup

Pipeline to convert a haploid assembly into diploid
Other
90 stars 10 forks source link

hapdup fails with multiple primary alignments #25

Closed mattloose closed 1 year ago

mattloose commented 1 year ago

I've got one sample in a series which is failing in hapdup with the following error.

Any suggestions?

Thanks

`

Starting merge Expected three tokens in header line, got 2 This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

    samtools view -F 0x904 YOUR.bam | cut -f 1 | sort | uniq -c | awk '$1 > 1'

Expected three tokens in header line, got 2 This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

    samtools view -F 0x904 YOUR.bam | cut -f 1 | sort | uniq -c | awk '$1 > 1'

[2022-12-13 17:15:57] ERROR: Missing output: hapdup/margin/MARGIN_PHASED.haplotagged.bam Traceback (most recent call last): File "/data/test_data/GIT/hapdup/hapdup.py", line 24, in sys.exit(main()) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 206, in main file_check(haplotagged_bam) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 114, in file_check raise Exception("Missing output") Exception: Missing output `

mikolmogorov commented 1 year ago

Never seen this before, but looks like the error might be inside PEPPER/Margin packages. Do you mind sending hapdup.log and full terminal output if you happen to have it? Is there anything special about this failing sample, compared to others (unusual coverage, read length, heterozygosity etc)?

mattloose commented 1 year ago

I don't believe there to be anything particularly unusual about the sample, but it does have some ultra long read data in it.

The log file is below.

I don't have the full terminal output but here is where it was at before the error:

Phasing 87% complete (8087/9279, 7m 45s). Estimated time remaining: 1m 9s > Phasing 88% complete (8180/9279, 7m 49s). Estimated time remaining: 1m 3s > Phasing 89% complete (8274/9279, 7m 53s). Estimated time remaining: 58s > Phasing 90% complete (8363/9279, 7m 57s). Estimated time remaining: 53s > Phasing 91% complete (8494/9279, 8m 3s). Estimated time remaining: 47s > Phasing 92% complete (8550/9279, 8m 6s). Estimated time remaining: 42s > Phasing 93% complete (8666/9279, 8m 11s). Estimated time remaining: 36s > Phasing 94% complete (8775/9279, 8m 15s). Estimated time remaining: 31s > Phasing 95% complete (8864/9279, 8m 18s). Estimated time remaining: 26s > Phasing 96% complete (8908/9279, 8m 20s). Estimated time remaining: 20s > Phasing 97% complete (9009/9279, 8m 23s). Estimated time remaining: 15s > Phasing 98% complete (9136/9279, 8m 26s). Estimated time remaining: 10s > Phasing 99% complete (9190/9279, 8m 26s). Estimated time remaining: 5s > Starting merge
Expected three tokens in header line, got 2
This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

hapdup.log

mikolmogorov commented 1 year ago

Thanks. According to this, it might have something to do with the duplicated read ids: https://github.com/kishwarshafin/pepper/issues/67

Could you try command line: samtools view -F 0x900 -q 10 your.bam | cut -f1 | uniq -c | awk '$1 > 1' | head? If the output is non-empty, there are duplicates. If so, they need to be removed, I don't think there is another workaround in Margin.

mattloose commented 1 year ago

Apologies - I've realised this was an error on our side where duplicated reads had got into the assembly pipeline.

Thanks for your help.