bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
56 stars 4 forks source link

Suggestion: ska map to only report positions of variable split-kmers #65

Closed rderelle closed 4 months ago

rderelle commented 4 months ago

Not issue, a suggestion:

ska map generates a VCF or alignment file corresponding to all positions of the reference genome.

Would it be interesting to have an option to restrict the mapping only to variable split-kmers? This could generate smaller output files and would avoid to manually filter the output if only variable sites (SNPs) are of interest.

johnlees commented 4 months ago

The VCF should be variant only, I think (let me know if not right, as this is what it says in the docs).

I could do this for .aln files too, but there is also snp-sites which offers the same functionality

rderelle commented 4 months ago

From what I remember, the VCF output all positions (I would need to confirm this).

But I'm possibly now encountering a bug. The ska map function generates an alignment without any issue, but SKA2 crashes when '-f vcf' is specified using the same skf file and same reference genome:

./ska_0.3.5 map -v -f vcf Spn_ATCC_700669.fna D39V__out0.skf SKA: Split K-mer Analysis (the alignment-free aligner) 2024-02-12T10:51:25.285Z INFO [ska] Loading skf as dictionary 2024-02-12T10:51:25.285Z INFO [ska::io_utils] Single file as input, trying to load as skf 64-bits 2024-02-12T10:51:26.612Z INFO [ska] Making skf of reference k=31 rc=true 2024-02-12T10:51:26.660Z INFO [ska::generic_modes] Converting skf to dictionary 2024-02-12T10:51:26.854Z INFO [ska::generic_modes] Mapping 2024-02-12T10:51:27.276Z INFO [ska::generic_modes] Generating VCF with 1 threads 2024-02-12T10:51:27.404Z INFO [ska::ska_ref] Converting to VCF thread 'main' panicked at src/ska_ref.rs:451:32: Could not add contig to header: Invalid note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

johnlees commented 4 months ago

Looks like it could be an issue with noodles_vcf. Which version of that and noodles is your ska built with, do you know?

Also, what do you get if you run grep ">" Spn_ATCC_700669.fna

rderelle commented 4 months ago

grep ">" Spn_ATCC_700669.fna NC_011900.1 Streptococcus pneumoniae ATCC 700669, complete sequence

I'm afraid I do not know about noodles.

johnlees commented 4 months ago

Can you attach the .fna file here so I can take a look

For the versions, look in Cargo.lock for noodles-core and noodles-vcf e.g.

[[package]]
name = "noodles-vcf"
version = "0.22.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "631a8103ca24338a93d0ed4b14c03a3d11d0290c2c643f7dc28327603ca3eb40"
dependencies = [
 "indexmap",
 "memchr",
 "nom",
 "noodles-bgzf",
 "noodles-core",
 "noodles-csi",
 "noodles-tabix",
 "percent-encoding",
]
rderelle commented 4 months ago

I can't check this as I only keep the binaries after compiling. I put the skf file and reference genomes are on my google drive: https://drive.google.com/drive/folders/1OWEQtq0UdnzZMPEIi-X0O_wxqEPn1mOo?usp=drive_link

johnlees commented 4 months ago

I don't think it likes the gaps in the chromsome name in the header, if I change the title in the ref fasta to:

>NC_011900.1

It works (and outputs only variant sites)

I will add a fix which deals with these formats of titles