Closed jdaeth274 closed 1 year ago
Hi Josh,
These are amibguous bases, see here: https://www.bioinformatics.org/sms/iupac.html
If you don't want these in the output you can run ska weed --filter no-ambig-or-const <skf_file>
.
Although this makes me realise that we don't have filter options in ska map
(they are in ska align
), and the weed option would remove constant sites which you probably don't want.
I'll add a solution for this shortly.
Hi John,
Thanks for explaining that!
Just for an update, I tried out ska weed
with the --no-ambig-or-const
filter on a .skf
file for a single isolate and it seemed to remove all the k-mers from the file.
Not sure if I should alter the default --min-freq
if there's only one sample in the .skf
file.
ska build -o test_weed -k 31 test_contigs.fasta
ska nk test_weed.skf
SKA: Split K-mer Analysis (the alignment-free aligner)
ska_version=0.3.0
k=31
k_bits=64
rc=true
1978995 k-mers
1 samples
["test_contigs.fasta"]
ska weed --filter no-ambig-or-const -v test_weed.skf
SKA: Split K-mer Analysis (the alignment-free aligner)
2023-07-03T09:26:45.434Z INFO [ska] Loading skf file
2023-07-03T09:26:45.434Z INFO [ska::merge_ska_array] Loading skf file
2023-07-03T09:26:45.833Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites or ambiguous bases
2023-07-03T09:26:45.855Z INFO [ska::merge_ska_array] Filtering removed 1978995 split k-mers
2023-07-03T09:26:45.855Z INFO [ska::generic_modes] Saving modified skf file
SKA done in 0s
ska nk test_weed.skf
SKA: Split K-mer Analysis (the alignment-free aligner)
ska_version=0.3.0
k=31
k_bits=64
rc=true
0 k-mers
1 samples
["test_contigs.fasta"]
This would be because with one sample all the sites are constant, I think I need to add #42 to allow you a no-ambig (and keep constant sites) filter
Will address this in #42
@jdaeth274 more filters added in v0.3.1, which has just been released. --ambig-mask
is probably of use to you here
Great, thanks John! I will try this out
Version:
0.3.0
Command run:Example output:
Context
Hi guys
Just been running into some issues with the mapped alignments for my large collection of pneumo isolates. The code above has run fine, but when I've put the resultant large alignment, created from
cat *.aln > tot_alignment.aln
, through Gubbins it highlighted odd non-nucleotide characters in the sequence.I've double checked through my assemblies used to create these alignments, and these characters listed above aren't present in there. An example in the alignment in the seaview screenshot here:
There is no
N
count in the example output above, maybe these are misassigned Ns? Let me know what you think is going on and if you need some more detail.