Non-nucleotide bases produced within ska map alignment

jdaeth274 commented 1 year ago

Version: 0.3.0 Command run:

cat ${FILE_LIST} | parallel -j 8 "NGSID=\$(basename {} | sed 's/_a.*fasta/_temp_skf/g');
 ska build -o \$NGSID -k 31 {} ;
MAPOUT=\$(basename {} | sed 's/_a.*fasta/_aln/g');
 ska map $REFERENCE_LOC \${NGSID}.skf -o \${MAPOUT}.aln"

Example output:

head -n 2 test_aln.aln | tail -n 1 | sed 's/./&\n/g' | sort | uniq -ic
      1 
 128836 -
 574319 A
     39 B
 380006 C
     63 D
 382082 G
     56 H
    154 K
    176 M
    800 R
     35 S
 569372 T
     37 V
    126 W
    766 Y

Context

Hi guys

Just been running into some issues with the mapped alignments for my large collection of pneumo isolates. The code above has run fine, but when I've put the resultant large alignment, created from cat *.aln > tot_alignment.aln , through Gubbins it highlighted odd non-nucleotide characters in the sequence.

I've double checked through my assemblies used to create these alignments, and these characters listed above aren't present in there. An example in the alignment in the seaview screenshot here:

ambiguous_bases_example

There is no N count in the example output above, maybe these are misassigned Ns? Let me know what you think is going on and if you need some more detail.

johnlees commented 1 year ago

Hi Josh, These are amibguous bases, see here: https://www.bioinformatics.org/sms/iupac.html If you don't want these in the output you can run ska weed --filter no-ambig-or-const <skf_file>.

Although this makes me realise that we don't have filter options in ska map (they are in ska align), and the weed option would remove constant sites which you probably don't want.

I'll add a solution for this shortly.

jdaeth274 commented 1 year ago

Hi John,

Thanks for explaining that!

Just for an update, I tried out ska weed with the --no-ambig-or-const filter on a .skf file for a single isolate and it seemed to remove all the k-mers from the file. Not sure if I should alter the default --min-freq if there's only one sample in the .skf file.

ska build -o test_weed -k 31 test_contigs.fasta 
ska nk test_weed.skf

SKA: Split K-mer Analysis (the alignment-free aligner)
ska_version=0.3.0
k=31
k_bits=64
rc=true
1978995 k-mers
1 samples
["test_contigs.fasta"]

ska weed --filter no-ambig-or-const -v test_weed.skf

SKA: Split K-mer Analysis (the alignment-free aligner)
2023-07-03T09:26:45.434Z INFO  [ska] Loading skf file
2023-07-03T09:26:45.434Z INFO  [ska::merge_ska_array] Loading skf file
2023-07-03T09:26:45.833Z INFO  [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites or ambiguous bases
2023-07-03T09:26:45.855Z INFO  [ska::merge_ska_array] Filtering removed 1978995 split k-mers
2023-07-03T09:26:45.855Z INFO  [ska::generic_modes] Saving modified skf file
SKA done in 0s

ska nk test_weed.skf 

SKA: Split K-mer Analysis (the alignment-free aligner)
ska_version=0.3.0
k=31
k_bits=64
rc=true
0 k-mers
1 samples
["test_contigs.fasta"]

johnlees commented 1 year ago

This would be because with one sample all the sites are constant, I think I need to add #42 to allow you a no-ambig (and keep constant sites) filter

johnlees commented 1 year ago

Will address this in #42

johnlees commented 12 months ago

@jdaeth274 more filters added in v0.3.1, which has just been released. --ambig-mask is probably of use to you here

jdaeth274 commented 12 months ago

Great, thanks John! I will try this out

bacpop / ska.rust

Non-nucleotide bases produced within ska map alignment #41

Context