WGLab / NanoCaller

Variant calling tool for long-read sequencing data
MIT License
90 stars 8 forks source link

header error in variant_calls.snps.phrased.vcf.gz #36

Open AzizHN opened 1 year ago

AzizHN commented 1 year ago

Hello I ran this command in order to detect variants in my mapped ONT reads (mapped with minimap2) NanoCaller --mode all --sequencing ont --haploid_genome --bam sorted_mapped_reads.bam --ref genes.fna

I got this as a result:

2023-06-23 12:27:16.562651: Starting NanoCaller.

NanoCaller command and arguments are saved in the following file: /home/aziz/mapping/SRR23337893/args

2023-06-23 12:27:16.947255: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. SNP Calling Progress: 100%|███████████████████████| 2/2 [00:00<00:00, 6.89it/s]

2023-06-23 12:27:18.763662: Combining SNP calls.

2023-06-23 12:27:18.764897: Compressing and indexing SNP calls. Writing to /tmp/bcftools.dkVQT8 Merging 1 temporary files Cleaning Done

2023-06-23 12:27:18.824115: SNP calling completed. Time taken= 0.4034

Indel Calling Progress: 100%|█████████████████████| 2/2 [00:00<00:00, 3.99it/s]

2023-06-23 12:27:19.487620: Compressing and indexing indel calls. Checking the headers and starting positions of 2 files [E::bcf_hdr_read] Input is not detected as bcf or vcf format Failed to parse header: /home/aziz/mapping/SRR23337893/variant_calls.snps.phased.vcf.gz

2023-06-23 12:27:20.501190: Indel calling completed. Time taken= 1.6770

2023-06-23 12:27:20.501373: Total Time Elapsed: 3.94 seconds

It seems that everything is going well, but there was a problem in the header in the file variant_calls.snps.phased.vcf.gz 2023-06-23 12:27:19.487620: Compressing and indexing indel calls. Checking the headers and starting positions of 2 files [E::bcf_hdr_read] Input is not detected as bcf or vcf format Failed to parse header: /home/aziz/mapping/SRR23337893/variant_calls.snps.phased.vcf.gz

Does this error can influence my results, does anyone have an idea about it ? Thanks in advance

umahsn commented 1 year ago

Hi,

Can you check if there any any intermediate files in /home/aziz/mapping/SRR23337893/ under intermediate_snp_files or intermediate_phase_files subfolders, or if there is any variant_calls.snps.vcf.gz file created? It seems very suspicious that SNP calling took only 0.4s so I am wondering if that step did not run correctly.

AzizHN commented 1 year ago

Hello @umahsn, thank you for your reply, Yes I have so many intermediate subfolders : intermediate_indel_files containing 2 files (variant_calls.6.indel.vcf and variant_calls.raw.indel.vcf) intermediate_phase_filescontaining 4 files (2X refsequenceID.snps.phased.vcf.gz and 2X refsequenceID.snps.phased.vcf.gz.tbi) ( I have 2 ref seqs in my fasta ref file) intermediate_snp_files containing 2 files (combined.snps.vcf and variant_calls.3.snps.vcf).

And yes, there are a variant_calls.snps.vcf.gz created (514 octets): a 7-lines header and 9-lines variants table.

My input files are a BAM file (555,1 Ko) and my ref is a fasta file (3,6 Ko)

umahsn commented 1 year ago

Hi, I think there might be a problem with passing the filenames internally within NanoCaller for haploid genomes. Let me check this and get back to you.

umahsn commented 1 year ago

Can you tell me if /home/aziz/mapping/SRR23337893/variant_calls.snps.phased.vcf.gz or refsequenceID.snps.phased.vcf.gz files are empty and if they have a header?

AzizHN commented 1 year ago

Hello @umahsn thanks you for your response. The phased files are always empty !!

umahsn commented 12 months ago

Hi,

I checked the issue and it turns out that presence of colon symbol ":" in the names of reference sequences is causing the problem. NanoCaller uses a linux system commands to run whatsapp for phasing and bcftools for VCF file manipulation. As a result, if a file VCF file that is named after a reference sequence that has colon in the name, then linux is not able to resolve the path to the file correctly. Once I replace colon with some other symbol in the reference and BAM files, it runs correctly.