Kingsford-Group / kourami

Kourami: Graph-guided assembly for HLA alleles
BSD 3-Clause "New" or "Revised" License
35 stars 12 forks source link

java.lang.NullPointerException during Bubble Processing #23

Open ariane-lozachmeur opened 5 years ago

ariane-lozachmeur commented 5 years ago

Hi,

I'm was running Kourami on a dozens of samples and for a few of them, I get this java error:

Bubble Processing and Path Assembly for:        A
Bubble Processing and Path Assembly for:        B
Bubble Processing and Path Assembly for:        C
Bubble Processing and Path Assembly for:        DQA1
Bubble Processing and Path Assembly for:        DQB1
java.lang.NullPointerException
        at Bubble.removeUnsupported(Bubble.java:842)
        at Bubble.<init>(Bubble.java:379)
        at Bubble.<init>(Bubble.java:384)
        at HLAGraph.countBubbles(HLAGraph.java:2050)
        at HLAGraph.countBubblesAndMerge(HLAGraph.java:900)
        at HLA.countBubblesAndMerge(HLA.java:357)
        at HLA.main(HLA.java:585)

In this case it happened during the Bubble Processing of DQB1 but it sometimes happens for other genes.

I should specify that I changed slightly the preprocessing scripts to work on the hg19 reference and include unmapped reads. Any idea of what might cause that? Thank you for your help and amazing work on Kourami!

heewookl commented 5 years ago

Hi,

Could you please explain what you have changed in the preprocessing scripts? My guess you changed the loci information for extracting HLA alleles from hg19? If you can share what you exactly changed, that will be helpful.

Also, could you please share the bam file aligned to the kourami panel? Thank you!

ariane-lozachmeur commented 5 years ago

You guessed correctly, I changed the loci information to extract reads from hg19. In addition, our testing showed better performances when we added back the unmapped reads to the extracted reads.

The resulting code is the following (for the extraction part of the preprocessing)

echo ">>>>>>>>>>>>>>>> extracting reads mapping to HLA loci and ALT contigs"
$samtools_bin view -b -o hla.bam $bam_path 6:29900000-29914000 6:31230000-31241000 6:31320000-31326000 \
                                6:33032000-33042000 6:33043000-33058000 6:32604000-32612000 \
                                6:32629000-32634500 6:32407550-32413000 \
                                6:32520000-32558000
$samtools_bin sort -n -@8 -o hla.sorted.bam hla.bam
$samtools_bin view -b -f4 -@8 -o unmapped.bam $bam_path
$samtools_bin sort -n -@8 -o unmapped.sorted.bam unmapped.bam
$samtools_bin merge -p -f all.bam hla.sorted.bam unmapped.sorted.bam
$samtools_bin sort --thread $num_processors -m $samtools_sort_memory_per_thread -O BAM -o $sampleid.extract.bam all.bam

The rest of the script is the same.

Unfortunately, I am not sure I will be able to send you the file. I am waiting for approval and I'll get back to you when I know more. In the meantime, here is what I tried to do to diagnose the issue, maybe that will help.

To try and see where the NullPointerException was coming from I made a little experiment. Say the Bubble Processing failed during assembly for DQA1, I isolated the reads mapped to DQA1 from the bam file aligned to the Kourami panel (using samtools view -o new.bam in.bam DQA1*01:01:01:01:0-100000 [...] DQA1*06:02:0-100000) Then I downsampled this bam by 10% (samtools view -s0.9 -o new.0.9.bam new.bam) Depending on the seed I used in samtools, those smaller bam file sometimes worked with Kourami and sometimes didn't. At that point I figured that some specific reads were probably causing the issue. So I identified which reads were in the files that didn't work and not in the files that worked. Removing those reads from the original on_KouramiPanel.bam file didn't solve the problem. Moreover, a bam file containing only those suspicious reads didn't cause the bug. Now my guess is that it is probably a certain subset of reads that together create this edge case.

Lastly, the end of the .log file looks like that:

=========================
=  DQB1
=========================
[k] = 2113
trying new k: [k] = 2112
BASE:   [A,2113]
BASE:   [C,2113]
BASE:   [G,2113]
BASE:   [T,2113]
trying new k: [k] = 2111
BASE:   [A,2112]
BASE:   [C,2112]
BASE:   [G,2112]
[...]
trying new k: [k] = 2065
BASE:   [G,2066]
BASE:   [T,2066]
trying new k: [k] = 2064
BASE:   [G,2065]
Found the new start!
Setting Trimming length(header):        49

I'm sorry I can't give you the bam file to reproduce this yet. Please let me know if you have any ideas or things you would like me to test. Thank you!

heewookl commented 5 years ago

Thank you for providing me with much details.

I am guessing you updated those coordinates from hg19 annotation for HLA genes for the extraction part. Including unmapped reads can certainly improve the results especially when the initial mapping is done on hg19, however, sometimes spurious mappings can creep in which can lead to mistyping.

I have a few questions:

1) Are all alignments performed by bwa? 2) Is this WGS? and what are the coverages for these samples?

I understand that you can't send me the extracted bam. In the meantime, I will think of what test you can run on your end to get this problem sorted out. I shall be in touch soon. Thank you.

ariane-lozachmeur commented 5 years ago
  1. The first alignment (to hg19) is performed by SNAP. But during the preprocessing for Kourami I only use bwa.
  2. This is exome sequencing on around 600 genes with a 100x coverage. By adding back the unmapped reads, I get around 500k reads mapped to the Kourami reference.
heewookl commented 5 years ago

Hi, I would like to follow up on this issue and would it be possible for you to drop me an email to arrange a secure data transfer? Thanks.

Rashesh7 commented 5 years ago

Hi, Sorry for jumping in, but I am also interested in editing the preprocessing scripts to work on the hg19 reference.

Please do let me know if this was resolved and there is a consensus on what reads should it extract.

Thanks.

NicolaLady commented 4 years ago

Hi,

I am having the same issue as the original poster. Any updates on how to resolve this?

I am working with whole exome sequencing. Could the issue be due to possibly low coverage at DQB1?

Issue:

----------------REF GRAPH CONSTRUCTION-------------- ---------------- READ LOADING -------------- ---------------- GRAPH CLEANING -------------- Bubble Processing and Path Assembly for: A Bubble Processing and Path Assembly for: B Bubble Processing and Path Assembly for: C Bubble Processing and Path Assembly for: DQA1 Bubble Processing and Path Assembly for: DQB1 java.lang.NullPointerException at Bubble.removeUnsupported(Bubble.java:842) at Bubble.(Bubble.java:379) at Bubble.(Bubble.java:384) at HLAGraph.countBubbles(HLAGraph.java:2040) at HLAGraph.countBubblesAndMerge(HLAGraph.java:890) at HLA.countBubblesAndMerge(HLA.java:357) at HLA.main(HLA.java:585)

The tail end of my log file


DQA1

[k] = 4757 trying new k: [k] = 4756 BASE: [G,4757] Found the new start! Setting Trimming length(header): 1 NumBubbles: 36 found

<----------------------------------> CHECKING INTER-SUPERBUBBLE PHASING: <---------------------------------->

Printing 1 fractured super bubbles. SUPER BUBBLE [0] FractureEndIndex:[ 0 ] IntersectionScore: 27.701600143119084 -8.23921481214836

candidate_0-0 CTGACCACGTTGCCTCTTGTGGTGTAAACTTGTACCAGTTTTACGGTCCCTCTGGCCAGTACACCCATGAATTTGATGGAGATGAGGAGTTCTACGTGGACC TGGAGAGGAAGGAGACTGCCTGGCGGTGGCCTGAGTTCAGCAAATTTGGAGGTTTTGACCCGCAGGGTGCACTGAGAAACATGGCTGTGGCAAAACACAACT TGAACATCATGATTAAACGCTACAACTCTACCGCTGCTACCAATG

FractureEndIndex:[ 0 ] IntersectionScore: 8.298399856880916 -52.64326228050601

candidate_0-1 CTGACCACGTCGCCTCTTATGGTGTAAACTTGTACCAGTCTTACGGTCCCTCTGGCCAGTACACCCATGAATTTGATGGAGATGAGCAGTTCTACGTGGACC TGGGGAGGAAGGAGACTGTCTGGTGTTTGCCTGTTCTCAGACAATTTAGATTTGACCCGCAATTTGCACTGACAAACATCGCTGTCCTAAAACATAACTTGA ACAGTCTGATTAAACGCTCCAACTCTACCGCTGCTACCAATG

AllelePair [0:0] {PAIRSCORE:-65.01606974352967 E_SUM:63724.0 MAXFLOW:59.0} AllelePair [0:1] {PAIRSCORE:-62.28610893381102 E_SUM:66484.0 MAXFLOW:65.0} AllelePair [1:1] {PAIRSCORE:-153.82416468024502 E_SUM:54037.0 MAXFLOW:17.0}

BEST PAIR[0:1]: -62.28610893381102 [0 0.7914742898034024 -8.23921481214836]BEST MATCH: DQA101:01:01G 249 1.0 [MaxFl ow:65.0] [MinDepth1:33.0] [MinDepth2:17.0] [1 0.23709713876802618 -52.64326228050601]BEST MATCH: DQA105:01:01G 246 1.0 [MaxFl ow:65.0] [MinDepth1:33.0] [MinDepth2:17.0]


DQB1

[k] = 2113 trying new k: [k] = 2112 BASE: [A,2113] BASE: [G,2113] BASE: [N,2113] trying new k: [k] = 2111 BASE: [A,2112] BASE: [N,2112] trying new k: [k] = 2110 BASE: [C,2111] BASE: [N,2111] trying new k: [k] = 2109 BASE: [G,2110] BASE: [N,2110] trying new k: [k] = 2108 BASE: [C,2109] BASE: [G,2109] BASE: [T,2109] trying new k: [k] = 2107 BASE: [A,2108] BASE: [C,2108] BASE: [T,2108] trying new k: [k] = 2106 BASE: [C,2107] BASE: [G,2107] trying new k: [k] = 2105 BASE: [A,2106] BASE: [C,2106] BASE: [T,2106] BASE: [N,2106] trying new k: [k] = 2104 BASE: [G,2105] BASE: [T,2105] BASE: [N,2105] trying new k: [k] = 2103 BASE: [A,2104] BASE: [T,2104] BASE: [N,2104] trying new k: [k] = 2102 BASE: [A,2103] BASE: [C,2103] BASE: [T,2103] trying new k: [k] = 2101 BASE: [G,2102] BASE: [N,2102] trying new k: [k] = 2100 BASE: [A,2101] BASE: [C,2101] BASE: [G,2101] BASE: [T,2101] BASE: [N,2101] trying new k: [k] = 2099 BASE: [A,2100] BASE: [G,2100] BASE: [T,2100] trying new k: [k] = 2098 BASE: [C,2099] BASE: [G,2099] BASE: [N,2099] trying new k: [k] = 2097 BASE: [C,2098] BASE: [G,2098] BASE: [N,2098] trying new k: [k] = 2096 BASE: [A,2097] BASE: [C,2097] BASE: [G,2097] BASE: [N,2097] trying new k: [k] = 2095 BASE: [C,2096] BASE: [G,2096] BASE: [T,2096] BASE: [N,2096] trying new k: [k] = 2094 BASE: [A,2095] BASE: [G,2095] BASE: [T,2095] trying new k: [k] = 2093 BASE: [C,2094] BASE: [T,2094] trying new k: [k] = 2092 BASE: [C,2093] Found the new start! Setting Trimming length(header): 21

Kind regards,

Nicola

nikolasthuesen commented 3 years ago

I got the same error, when running Kourami on extracted WES data from the 1000 genomes samples NA19346 and NA19471. These were the only two giving this error out of the 829 samples I ran the workflow for. The errors were not caused by using hg19 reference - I was using GRCh38. The error happened during the typing of DQB1 as well, meaning that Kourami had already typed A, B, C and DQA1. As the error happened, both samples had a missing ".result" file and the predictions, that it did manage to do before the error, could therefore not be found here. However, the typing results of A, B, C and DQA1 were still present in the ".log" file. If anyone is still having this problem, and like me don't have time to wait for the fix, the predictions for A, B, C and DQA1 can still be retrieved.

masen1991 commented 3 years ago

Can Kourami work on hg19 ref samples? and by the way,how can i use Kourami on WES samples (maybe hg19 ref or hg38 ref)?