cytham / nanovar

Structural variant caller for low-depth long-read sequencing data
GNU General Public License v3.0
45 stars 10 forks source link

Nanovar fails with Python KeyError when clustering breakends #19

Closed oneillkza closed 3 years ago

oneillkza commented 3 years ago

Similar to #18 , I'm running NanoVar from the BioContainers container, this time on the full GM24385 data (one flowcell of PromethION WGS at around 30X coverage).

This is the output from nohup (capturing stdout/stderr):

[23/02/2021 22:00:22] - NanoVar started

Traceback (most recent call last):
  File "/usr/local/bin/nanovar", line 479, in <module>
    main()
  File "/usr/local/bin/nanovar", line 307, in main
    run.cluster_extract()
  File "/usr/local/lib/python3.8/site-packages/nanovar/nv_characterize.py", line 94, in cluster_extract
    cluster_out, self.seed2 = sv_cluster(self.total_subdata, self.total_out, self.buff, self.maxovl, self.mincov,
  File "/usr/local/lib/python3.8/site-packages/nanovar/nv_cluster.py", line 49, in sv_cluster
    readteam, infodict, classdict, mainclass, svsizedict = rangecollect(parse,
  File "/usr/local/lib/python3.8/site-packages/nanovar/nv_cluster.py", line 104, in rangecollect
    rightclust[chm1][rnameidx + '-l'] = le
KeyError: '1'

This is the Nanovar log:

[23/02/2021 22:00:22] - INFO - Initialize NanoVar log file
[23/02/2021 22:00:22] - INFO - Version: NanoVar-1.3.8
[23/02/2021 22:00:22] - INFO - Command: /usr/local/bin/nanovar -t 24 PAG33026.bam hg38_no_alt.fa nanovar_tmp
[23/02/2021 22:00:23] - INFO - Input file: PAG33026.bam
[23/02/2021 22:00:23] - INFO - Read type: ont
[23/02/2021 22:00:23] - INFO - Reference genome: hg38_no_alt.fa
[23/02/2021 22:00:23] - INFO - Working directory: nanovar_tmp
[23/02/2021 22:00:23] - INFO - Model: /usr/local/lib/python3.8/site-packages/nanovar/model/ANN.E100B400L3N12-5D0.4-0.2SGDsee11_het_gup_v1.h5
[23/02/2021 22:00:23] - INFO - Filter file: None
[23/02/2021 22:00:23] - INFO - Minimum number of reads for calling a breakend: 2
[23/02/2021 22:00:23] - INFO - Minimum SV len: 25
[23/02/2021 22:00:23] - INFO - Mapping percent for split-read: 0.05
[23/02/2021 22:00:23] - INFO - Length buffer for clustering: 50
[23/02/2021 22:00:23] - INFO - Score threshold: 1.0
[23/02/2021 22:00:23] - INFO - Homozygous read ratio threshold: 0.75
[23/02/2021 22:00:23] - INFO - Heterozygous read ratio threshold: 0.35
[23/02/2021 22:00:23] - INFO - Number of threads: 24

[23/02/2021 22:00:23] - INFO - Total number of reads in FASTQ/FASTA: -

[23/02/2021 22:00:23] - INFO - NanoVar started
[23/02/2021 22:01:09] - INFO - Input BAM file, skipping minimap2 alignment
[23/02/2021 22:01:13] - DEBUG - Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
[23/02/2021 22:01:16] - INFO - Parsing BAM and detecting SVs
[24/02/2021 01:17:21] - INFO - Gap dictionary not loaded.
[24/02/2021 01:56:27] - INFO - Genome size: 3099922541 bases
[24/02/2021 01:56:27] - INFO - Mapped bases: 106823698851 bases
[24/02/2021 01:56:27] - INFO - Depth of coverage: 34.46x
[24/02/2021 01:56:27] - INFO - Read overlap upper limit: 10

[24/02/2021 01:56:27] - INFO - Total number of mapped reads: 9869350

[24/02/2021 01:56:27] - INFO - Clustering SV breakends
cytham commented 3 years ago

Can you please rerun your run with the --debug option and send me the "genome.sizes" and "parse1.tsv" files. You can send to my email e0054302@u.nus.edu if they are too large.

oneillkza commented 3 years ago

I've rerun with --debug and put those files up at https://www.bcgsc.ca/downloads/koneill/nanovar_test/

parse1.tsv is about 700MB

cytham commented 3 years ago

Thanks for the files. May I check with you if the reference genome file you used to generate the BAM file (i.e. PAG33026.bam) the same as the reference file you used to run NanoVar (i.e. hg38_no_alt.fa)? Because it seems like a mismatching of chromosome/contig labeling (e.g. hg38_no_alt.fa has "chr1" but your BAM might have been "1"). Can you please check this?

If this is not the issue, I would need you to upload the subdata.tsv, detect.tsv, and log files for further investigation.

Thanks for your patience.

cytham commented 3 years ago

@oneillkza were you able to resolve the issue?

oneillkza commented 3 years ago

Oh sorry -- I had a reply typed to this and didn't post it. Yes, that was the issue -- I had an hg19 bam but was giving it the hg38 reference. It should be fine to close this issue.

It might be worth putting in a check for this somewhere in nanovar, since this is an easy mistake to make, and the error message was a little cryptic.

Thanks for looking into it!