Closed rderelle closed 4 months ago
The VCF should be variant only, I think (let me know if not right, as this is what it says in the docs).
I could do this for .aln files too, but there is also snp-sites which offers the same functionality
From what I remember, the VCF output all positions (I would need to confirm this).
But I'm possibly now encountering a bug. The ska map function generates an alignment without any issue, but SKA2 crashes when '-f vcf' is specified using the same skf file and same reference genome:
./ska_0.3.5 map -v -f vcf Spn_ATCC_700669.fna D39V__out0.skf
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-02-12T10:51:25.285Z INFO [ska] Loading skf as dictionary
2024-02-12T10:51:25.285Z INFO [ska::io_utils] Single file as input, trying to load as skf 64-bits
2024-02-12T10:51:26.612Z INFO [ska] Making skf of reference k=31 rc=true
2024-02-12T10:51:26.660Z INFO [ska::generic_modes] Converting skf to dictionary
2024-02-12T10:51:26.854Z INFO [ska::generic_modes] Mapping
2024-02-12T10:51:27.276Z INFO [ska::generic_modes] Generating VCF with 1 threads
2024-02-12T10:51:27.404Z INFO [ska::ska_ref] Converting to VCF
thread 'main' panicked at src/ska_ref.rs:451:32:
Could not add contig to header: Invalid
note: run with RUST_BACKTRACE=1
environment variable to display a backtrace
Looks like it could be an issue with noodles_vcf
. Which version of that and noodles is your ska built with, do you know?
Also, what do you get if you run grep ">" Spn_ATCC_700669.fna
grep ">" Spn_ATCC_700669.fna NC_011900.1 Streptococcus pneumoniae ATCC 700669, complete sequence
I'm afraid I do not know about noodles.
Can you attach the .fna file here so I can take a look
For the versions, look in Cargo.lock
for noodles-core
and noodles-vcf
e.g.
[[package]]
name = "noodles-vcf"
version = "0.22.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "631a8103ca24338a93d0ed4b14c03a3d11d0290c2c643f7dc28327603ca3eb40"
dependencies = [
"indexmap",
"memchr",
"nom",
"noodles-bgzf",
"noodles-core",
"noodles-csi",
"noodles-tabix",
"percent-encoding",
]
I can't check this as I only keep the binaries after compiling. I put the skf file and reference genomes are on my google drive: https://drive.google.com/drive/folders/1OWEQtq0UdnzZMPEIi-X0O_wxqEPn1mOo?usp=drive_link
I don't think it likes the gaps in the chromsome name in the header, if I change the title in the ref fasta to:
>NC_011900.1
It works (and outputs only variant sites)
I will add a fix which deals with these formats of titles
Not issue, a suggestion:
ska map generates a VCF or alignment file corresponding to all positions of the reference genome.
Would it be interesting to have an option to restrict the mapping only to variable split-kmers? This could generate smaller output files and would avoid to manually filter the output if only variable sites (SNPs) are of interest.