cytham / nanovar

Structural variant caller for low-depth long-read sequencing data
GNU General Public License v3.0
45 stars 10 forks source link

TypeError: Error: Genome filter BED contig id not found in reference at line 1 #55

Closed egnst closed 1 year ago

egnst commented 1 year ago

Hi Cheng Yong,

Since the last time I used NanoVar, I've upgraded my version of Anaconda, and I've finally upgraded to using hg38, and now my old script stopped working.

I used the following command:

nanovar -t 24 -f hg38 /usr/fastq_pass/sample.fastq /usr/data/Homo_sapiens.GRCh38.dna.primary_assembly.fa .

Then I get the following error:

[09/12/2022 16:34:01] - NanoVar started Traceback (most recent call last): File "/usr/.local/bin/nanovar", line 557, in main() File "/usr/.local/bin/nanovar", line 266, in main if bed_valid(filter_path, contig_len_dict): File "/usr/.local/lib/python3.9/site-packages/nanovar/nv_valid.py", line 212, in bed_valid raise TypeError("Error: Genome filter BED contig id not found in reference at line %s" % str(c)) TypeError: Error: Genome filter BED contig id not found in reference at line 1

I've tried this using the fastq file and an lra-aligned bam file as input, and get the same error either way. What does this error actually mean? I'm happy for any ideas on what to try troubleshooting first.

cytham commented 1 year ago

Hi @egnst, thank you for using NanoVar.

The error you encountered results from different chromosome/contig ids between the built-in hg38 filter bed file and the reference genome you provided. The hg38 built-in filter file uses ids of 'chr1', 'chr2', etc. Can you please let me know the chromosome/contig ids of your reference file? You can easily see all of them in the '.fai' file of the reference fasta.

If that's really the problem, one easy fix is to remove the '-f hg38' the option. But do note that the output variants in the VCF will not be filtered for known false positives.

egnst commented 1 year ago

Thank you for your explanation. I tried to manually edit all my chromosomes to be "chr1" instead of "1", etc. That made my script run longer before it crashed, but I started getting hs-blastn errors instead. I assumed there was just something fundamentally off about my reference genome, so I tried to download a different reference file from UCSC that uses "chr1" notation.

However, I'm still getting these hs-blastn errors. I've tried multiple reference genomes and multiple versions of hs-blastn. I'm out of ideas, so I'm happy to try any ideas you might have. Here's the errors from the log file:

[13/12/2022 12:21:55] - INFO - Make FMD-index [13/12/2022 12:21:55] - DEBUG - [Tue Dec 13 12:21:55 2022 corelib/line_reader.c:HbnBufferedLineReaderNew:21] fail to open file 'index' with mode 'r': No such file or directory [13/12/2022 12:21:55] - CRITICAL - Error: hs-blastn index failed ... [13/12/2022 12:43:17] - DEBUG - [Tue Dec 13 12:43:17 2022 app/hbnmap/cmdline_args.cpp:ParseHbnProgramCmdLineArguments:831] Too many query and subject values: 'align -db ~/data/hg38.fa -window_masker_db ./hg38.counts.obinary -query ./temp2.fa -out ./temp-hg38-blast.tab -outfmt 6 -num_threads 23 -max_target_seqs 3 -gapopen 0 -gapextend 4 -penalty -3 -reward 2' [13/12/2022 12:43:18] - CRITICAL - Error: hs-blastn alignment failed

cytham commented 1 year ago

@egnst There might be a problem with the hsblastn index. Can you try rerunning with the --force option to recreate the index? Thanks

egnst commented 1 year ago

I gave it a try, and I'm still getting the same thing:

[13/12/2022 14:27:03] - DEBUG - [Tue Dec 13 14:27:03 2022 corelib/line_reader.c:HbnBufferedLineReaderNew:21] fail to open file 'index' with mode 'r': No such file or directory [13/12/2022 14:27:03] - CRITICAL - Error: hs-blastn index failed [13/12/2022 14:30:57] - INFO - Gap dictionary successfully loaded [13/12/2022 14:33:42] - INFO - Genome size: 3209286105 bases [13/12/2022 14:33:42] - INFO - Mapped bases: 14361149197 bases [13/12/2022 14:33:42] - INFO - Depth of coverage: 4.47x [13/12/2022 14:33:42] - INFO - Read overlap upper limit: 18.0

[13/12/2022 14:33:42] - INFO - Total number of mapped reads: 1803553

[13/12/2022 14:33:42] - INFO - Clustering SV breakends [13/12/2022 14:34:03] - INFO - Filtering INS and INV SVs [13/12/2022 14:34:28] - DEBUG - [Tue Dec 13 14:34:28 2022 app/hbnmap/cmdline_args.cpp:ParseHbnProgramCmdLineArguments:831] Too many query and subject values: 'align -db ~/data/hg38.fa -window_masker_db ./hg38.counts.obinary -query ./temp2.fa -out ./temp-hg38-blast.tab -outfmt 6 -num_threads 23 -max_target_seqs 3 -gapopen 0 -gapextend 4 -penalty -3 -reward 2' [13/12/2022 14:34:28] - CRITICAL - Error: hs-blastn alignment failed

cytham commented 1 year ago

@egnst can you confirm you have write access in the folder of your reference genome?

cytham commented 1 year ago

Closed due to no response