cytham / nanovar

Structural variant caller for low-depth long-read sequencing data
GNU General Public License v3.0
45 stars 10 forks source link

Nanovar segfaults on Neural network inference stage #18

Closed oneillkza closed 3 years ago

oneillkza commented 3 years ago

I'm running from the Biocontainers docker container via Singularity. The data set is a GM24385 test set we sequenced on a PromethION flowcell. I subsambled the first 10Mbp of chr20 for this test. It gets to the "neural network inference" and then segmentation faults.

[23/02/2021 18:25:06] - INFO - Initialize NanoVar log file
[23/02/2021 18:25:06] - INFO - Version: NanoVar-1.3.8
[23/02/2021 18:25:06] - INFO - Command: /usr/local/bin/nanovar results/running_id/bam/minimap_2_2_17_running_id.sorted.bam hg38_no_alt.fa .
[23/02/2021 18:25:06] - INFO - Input file: results/running_id/bam/minimap_2_2_17_running_id.sorted.bam
[23/02/2021 18:25:06] - INFO - Read type: ont
[23/02/2021 18:25:06] - INFO - Reference genome: hg38_no_alt.fa
[23/02/2021 18:25:06] - INFO - Working directory: .
[23/02/2021 18:25:06] - INFO - Model: /usr/local/lib/python3.8/site-packages/nanovar/model/ANN.E100B400L3N12-5D0.4-0.2SGDsee11_het_gup_v1.h5
[23/02/2021 18:25:06] - INFO - Filter file: None
[23/02/2021 18:25:06] - INFO - Minimum number of reads for calling a breakend: 2
[23/02/2021 18:25:06] - INFO - Minimum SV len: 25
[23/02/2021 18:25:06] - INFO - Mapping percent for split-read: 0.05
[23/02/2021 18:25:06] - INFO - Length buffer for clustering: 50
[23/02/2021 18:25:06] - INFO - Score threshold: 1.0
[23/02/2021 18:25:06] - INFO - Homozygous read ratio threshold: 0.75
[23/02/2021 18:25:06] - INFO - Heterozygous read ratio threshold: 0.35
[23/02/2021 18:25:06] - INFO - Number of threads: 1

[23/02/2021 18:25:06] - INFO - Total number of reads in FASTQ/FASTA: -

[23/02/2021 18:25:06] - INFO - NanoVar started
[23/02/2021 18:25:57] - INFO - Input BAM file, skipping minimap2 alignment
[23/02/2021 18:26:10] - DEBUG - Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
[23/02/2021 18:26:21] - INFO - Parsing BAM and detecting SVs
[23/02/2021 18:26:50] - INFO - Gap dictionary not loaded.
[23/02/2021 18:26:55] - INFO - Genome size: 3099922541 bases
[23/02/2021 18:26:55] - INFO - Mapped bases: 358142948 bases
[23/02/2021 18:26:55] - INFO - Depth of coverage: 0.12x
[23/02/2021 18:26:55] - WARNING - Sequencing depth is less than 4x, output may not be comprehensive
[23/02/2021 18:26:55] - INFO - Read overlap upper limit: 10

[23/02/2021 18:26:55] - INFO - Total number of mapped reads: 33238

[23/02/2021 18:26:55] - INFO - Clustering SV breakends
[23/02/2021 18:26:57] - INFO - Filtering INS and INV SVs
[23/02/2021 18:26:58] - DEBUG - Make blast index skipped
[23/02/2021 18:26:58] - INFO - Windowmasker counts
[23/02/2021 18:27:00] - DEBUG - computing the genome length
[23/02/2021 18:27:17] - DEBUG - pass 1
[23/02/2021 18:44:44] - DEBUG - pass 2
[23/02/2021 19:02:13] - INFO - Windowmasker binary
[23/02/2021 19:02:13] - DEBUG - reading counts...
[23/02/2021 19:02:26] - DEBUG - converting counts...
[23/02/2021 19:03:26] - DEBUG - converting parameters...
[23/02/2021 19:03:26] - DEBUG - final processing...
[23/02/2021 19:03:26] - DEBUG - optimizing the data structure
[23/02/2021 19:03:54] - DEBUG - Make hs-blastn FMD-index skipped
[23/02/2021 19:03:54] - DEBUG - [HS-BLASTN] Loading database.
[23/02/2021 19:03:54] - DEBUG - Loading hg38_no_alt.fa.sequence, size = 3GB
[23/02/2021 19:04:25] - DEBUG - Loading hg38_no_alt.fa.bwt, size = 3GB
[23/02/2021 19:04:53] - DEBUG - Loading hg38_no_alt.fa.sa, size = 6GB
[23/02/2021 19:05:50] - DEBUG - [HS-BLASTN] done. Time elapsed: 116.05 secs.
[23/02/2021 19:05:50] - DEBUG - [HS-BLASTN] Processing ./temp2.fa.
[23/02/2021 19:05:50] - DEBUG - Processing 103 queries.
[23/02/2021 19:06:00] - DEBUG - [HS-BLASTN] done. Elpased time: 10.1990 secs.
[23/02/2021 19:06:00] - DEBUG - [HS-BLASTN] 103 queries processed.
[23/02/2021 19:06:03] - DEBUG - [CMD] hs-blastn align -db hg38_no_alt.fa -window_masker_db ./hg38_no_alt.counts.obinary -query ./temp2.fa -out ./temp-hg38_no_alt-blast.tab -outfmt 6 -num_threads 1 -max_target_seqs 3 -gapopen 0 -gapextend 4 -penalty -3 -reward 2
[23/02/2021 19:06:03] - INFO - Gap dictionary not loaded.
[23/02/2021 19:06:03] - INFO - Parsing BAM and detecting INV and INS SVs
[23/02/2021 19:06:03] - INFO - Re-clustering INS/INV SVs and merging
[23/02/2021 19:06:04] - INFO - Neural network inference
cytham commented 3 years ago

Hi, I have not tested on Biocontainers docker container via Singularity before. There might be a memory problem. How much memory did you allocate for this job?

oneillkza commented 3 years ago

I was running this on one of our "general purpose" hosts that has 1.5TB of RAM. There were other users running jobs, but there should have been at least 700-800GB free.

oneillkza commented 3 years ago

Is there some minimal example data I can use to test that the container is working correctly?

cytham commented 3 years ago

Currently, there is no test set, but I'm working on it for future versions. Is this problem still persisting?

oneillkza commented 3 years ago

Hi @cytham!

Yes, unfortunately this problem is still persisting. If you want to try to debug it, I have a minimal-ish bam you can run:

https://www.bcgsc.ca/downloads/koneill/nanovar_test/lra_1_1_2_running_id.sorted.bam https://www.bcgsc.ca/downloads/koneill/nanovar_test/lra_1_1_2_running_id.sorted.bam.bai

It's the first 10Mb of chr20, from some NA24385 data we sequenced internally.

cytham commented 3 years ago

Hi @oneillkza, sorry for the wait. I have released a new version (v1.3.9). I suggest to install the new version in a new environment and try running it again.

cytham commented 3 years ago

@zhemingfan Hi, did you post an issue?

zhemingfan commented 3 years ago

Hi @cytham I did - I wanted to retry something, but it didn't end up working. I'm not sure if this is something on pysam's end - but I was getting

Traceback (most recent call last):
  File "/home/jfan/miniconda3/bin/nanovar", line 494, in <module>
    main()
  File "/home/jfan/miniconda3/bin/nanovar", line 301, in main
    run.bam_parse_detect()
  File "/home/jfan/miniconda3/lib/python3.8/site-packages/nanovar/nv_characterize.py", line 76, in bam_parse_detect
    = bam_parse(self.bam, self.minlen, self.splitpct, self.minalign, self.dir, self.filter, self.contig_omit)
  File "nanovar/nv_bam_parser.pyx", line 67, in nanovar.nv_bam_parser.bam_parse
  File "pysam/libcalignedsegment.pyx", line 2399, in pysam.libcalignedsegment.AlignedSegment.get_tag
  File "pysam/libcalignedsegment.pyx", line 2438, in pysam.libcalignedsegment.AlignedSegment.get_tag
KeyError: "tag 'AS' not present"

After converting the BAM file to a SAM, adding an AS tag to the SAM file, then re-converting it to a BAM file, the error persists. Do you have any suggestions?

cytham commented 3 years ago

Hi @zhemingfan, may I know how the bam file was generated? Was it through NanoVar? Or did you align it with your own aligner?

oneillkza commented 3 years ago

@cytham we're using minimap2, but would likely also be interested in trying NanoVar with LRA.

Could I please ask what you mean by "through NanoVar"? I was under the impression (from the readme) that NanoVar took an aligned bam as input.

cytham commented 3 years ago

Ok, minimap2 should be fine. Not sure if LRA will be compatible though.

NanoVar can also take FASTA/FASTQ files as input, which it will first do alignment with minimap2.

zhemingfan commented 3 years ago

Hi @cytham , thank you for all your help! I was able to get Nanovar working with minimap2, but after re-doing the alignment, LRA is still giving the same KeyError.

cytham commented 3 years ago

i believe LRA is not compatible with NanoVar. It seems like it is lacking the 'AS' tag in its output SAM file after alignment. Is there any parameters to add the 'AS' tag in LRA output?

When you said you've added the 'AS' tag manually, did you also add the alignment score that comes with the tag? Because this is required in NanoVar.

oneillkza commented 3 years ago

Yeah LRA puts the alignment score in a custom tag "NV" rather than in "AS" as is customary. It looks like their score is also a float rather than an int, which may be why they chose to use a custom tag?

Looking at the values themselves, they seem to be more-or-less in the same range as minimap2 AS values, although maybe about two-fold smaller. Also the floating point seem to have about 4-6 significant digits to their left, so I doubt there'd be any loss of precision by having it be an integer.

One potential solution would be to generate AS tags for an LRA bam from the NV tag but rounded to the nearest integer.

cytham commented 3 years ago

@oneillkza thanks for looking into this matter and raising it up to the authors of LRA. Yes I believe replacing NV with AS might work, though we still have to test its readability of Floats by Pysam, which is what NanoVar uses to read the SAM file. I would hope that the authors of LRA would change its tag labels to follow the current SAM format in order for its SAM files to be readable by existing packages such as pysam.