Closed danrlu closed 1 year ago
This could be a few things:
--filter-ambiguous
to ska distance
.It could also be a bug in the implementation. If you are able to send me the input files A_R1.fq.gz A_R2.fq.gz B_R1.fq.gz B_R2.fq.gz I can take a look
Testing on fasta input gives me confidence that the distance is working properly, and this is likely due to a filtering criteria with fastq read input:
SKA1
ska fasta 19183_8#29.contigs_velvet.fa -o 19183_8#29
ska fasta 2673_8\#29.contigs_velvet.fa -o 12673_8#29
ska distance *.skf
cat distances.distances.tsv
gives
Sample 1 Sample 2 Matches Mismatches Jaccard Index Mash-like distance SNPs SNP distance
12673_8#29 19183_4#73 2874011 129683 0.956825 0.000719697 44 1.53096e-05
SKA2
ska build -k 31 12673_8\#29.contigs_velvet.fa 19183_4#73.contigs_velvet.fa -o dist_test -v
ska distance dist_test.skf -v
gives
Sample1 Sample2 Distance Mismatches
12673_8#29.contigs_velvet 19183_4#73.contigs_velvet 45.50 0.04317
or adding --filter-ambiguous
which reports removing 34 split k-mers:
Sample1 Sample2 Distance Mismatches
12673_8#29.contigs_velvet 19183_4#73.contigs_velvet 41.00 0.04317
The mismatches/matches proportion are the same. The result is very similar, but not identical to the previous version.
Alignment with ska2:
>12673_8#29.contigs_velvet
AAGCTATCCCCTATGTCGAGCACGGGCGTGAGGCACAAAAA
>19183_4#73.contigs_velvet
GGATCTCTTTTCCCACTTGATGTAATTAGAGAATGTGGTGG
With ska1:
>12673_8#29
AGCTCATAGGGTAAGAGTGTGATTCAAGCTAACGGGTCCAAGCC
>19183_4#73
TAACTGCGAAAGGGATACAAATGATGGATCGGTATACTTGGATT
ska1 humanise and filtering
CGGCGAGGTCAGCTGCGATTGATAATAATC G A
ATATATTTGCCATTTAAAGTAGTTATTTCG C T
AAATATGTGGCTGCGGTGATGGGAAACAGT G A
ATAAAAATTAAAAATGTAACAAAATTTCCA G A
ACTTAGAAAAAATAACACTAATAATTAGTA G A
TCGCTTACTTGTCCACTTGGATTGCGCCAA A G
AATCACCTCGTAAAATTTTACGAGGTGATT T A
ACCGAGCGACTGTAAAGAAGCCAATCTAAA C T
GGTAGAAATACGATGCAGTCATTTCTGATA G A
AGAGTCTTTAGAGCCTAACGACGAACATTA G A
GTTGGCTTTGTATTTTAATGGCGGTGTTGA G A
GCCGGAATCAATTTTTACAAGCACAAGAAA T A
CATTAACTCGCGAAATCGCTCATTGCACCA T C
ACCACAACCGCGACAATGAAATAATCATTA G A
CGATGCGAAAGAATCCTTACTTCTTCAAAA A T
GAAAAGCGAAGACTTTTTAATGTAGAAAAA C A
GCTATTTTTGCAATTTCATGGCGACAAGAC T G
ACTAATGCTTGGTTCTGCTTATACTTGTCC G A
AAAATTCCTGGTGTCCTGCTATGTCTCCAC C T
ACAATACAAACGTATTTAGAGCTTGATGAT G A
AAGTGCCGCTTTCGGCCAATTACAGTTCCG C T
CGGATGGGACAACGGAATCACAATAATTCC A G
AAACTAGAATATACATGGGTAACTATCAAG A G
GGGAGTTAGTGGAGCATTGCAGCTGCGCCA T C
TATCAGGTAGCGCAGACCACCAGATTTACA T C
CAGATGATGAAGGTACTTGGTCGTCGAAGG C T
ATCCCAGCCAAAGCAAGAATGATCAATCCC C T
AGACAAAGACGGTAACCAATCACAGGCCTA A G
GACCGCTTTGAAACGATGCAGCTGGAGGAA A G
ATCCACAACAATAACCGGCTTTCCTACTAA C T
AAACGTAGACTTGTCCGCGAATTCGTTAAC T C
CTGGAGCCTCCGCTTTACTAGTGGAGTTTC A G
AAAACCAGCGAATGTGATACAGAAGAAATC A G
AAAAGAAATATATTCGGTTTCACAGGACCA C T
ATCATAAACCCCAAGCCAATTTCCTTATCA A G
ATTTGCATTTTATTGCAATAAAATGCAAAT A T
TATAATCCGGAAGCCCAATACGCTTTGACA A G
AAAGCCAGCAATTTTAAAATTGCTGGCTTT A T
TAATAAACCAGAGTAAAAGTAAACAAGTAA T C
GTTAAACATACCACGATTTGTGGCATCGTA T G
ATCTGTTGCAGCTAAATAGTTCCTGTTGGA A G
AGCAGCAAGATTTACCTGGATGCTTTGCCC G A
GAAGCGCTGCAAGATCCGAGCAAAAAGCGA G A
ATTCGAGCTGATCTATGATTGCCTAATTTG G T
ska2 nk --full-info
:
ATCATAAACCCCAAG CCAATTTCCTTATCA A G
AAAACCAGCGAATGT GATACAGAAGAAATC A G
ATAAAAATTAAAAAT GTAACAAAATTTCCA G A
ATATATTTGCCATTT AAAGTAGTTATTTCG C T
AAACGTAGACTTGTC CGCGAATTCGTTAAC T C
TTTCTTGTGCTTGTA AAAATTGATTCCGGC A T
TAATAAACCAGAGTA AAAGTAAACAAGTAA T C
AAAAGAAATATATTC GGTTTCACAGGACCA C T
ATCCCAGCCAAAGCA AGAATGATCAATCCC C T
TATCAGAAATGACTG CATCGTATTTCTACC C T
ATCCACAACAATAAC CGGCTTTCCTACTAA C T
CATTAACTCGCGAAA TCGCTCATTGCACCA T C
TACGATGCCACAAAT CGTGGTATGTTTAAC A C
TTCCTCCAGCTGCAT CGTTTCAAAGCGGTC T C
ACTAATGCTTGGTTC TGCTTATACTTGTCC G A
TATCAGGTAGCGCAG ACCACCAGATTTACA T C
AAGTGCCGCTTTCGG CCAATTACAGTTCCG C T
ATTCGAGCTGATCTA TGATTGCCTAATTTG G T
AGACAAAGACGGTAA CCAATCACAGGCCTA A G
ACCACAACCGCGACA ATGAAATAATCATTA G A
CAGATGATGAAGGTA CTTGGTCGTCGAAGG C T
CTGGAGCCTCCGCTT TACTAGTGGAGTTTC A G
AAAATTCCTGGTGTC CTGCTATGTCTCCAC C T
ACAATACAAACGTAT TTAGAGCTTGATGAT G A
AAATATGTGGCTGCG GTGATGGGAAACAGT G A
TTTTTCTACATTAAA AAGTCTTCGCTTTTC G T
ACCGAGCGACTGTAA AGAAGCCAATCTAAA C T
ACTTAGAAAAAATAA CACTAATAATTAGTA G A
GCTATTTTTGCAATT TCATGGCGACAAGAC T G
CGGCGAGGTCAGCTG CGATTGATAATAATC G A
TATAATCCGGAAGCC CAATACGCTTTGACA A G
AGAGTCTTTAGAGCC TAACGACGAACATTA G A
AGCAGCAAGATTTAC CTGGATGCTTTGCCC G A
TCAACACCGCCATTA AAATACAAAGCCAAC C T
CGGATGGGACAACGG AATCACAATAATTCC A G
TCGCTTTTTGCTCGG ATCTTGCAGCGCTTC C T
AAACTAGAATATACA TGGGTAACTATCAAG A G
TCGCTTACTTGTCCA CTTGGATTGCGCCAA A G
CGATGCGAAAGAATC CTTACTTCTTCAAAA A T
ATCTGTTGCAGCTAA ATAGTTCCTGTTGGA A G
TGGCGCAGCTGCAAT GCTCCACTAACTCCC A G
The three split k-mers in ska1 and not in ska2 are:
AAAGCCAGCAATTTTAAAATTGCTGGCTTT A T
AATCACCTCGTAAAATTTTACGAGGTGATT T A
ATTTGCATTTTATTGCAATAAAATGCAAAT A T
These are all palindromes so end up as 'W' as we don't know which strand they are on (i.e. A or T for the middle base), which I think is better behaviour.
(Sorry for the delay! I was out traveling.)
My previous test data was not published, so I reran with the data below. They came from a Legionella outbreak investigation here. See Table 4 and Figure 3 for the corresponding samples (sample 125, 192091 and 4029) and tree with SNP distance using more conventional SNP calling. https://www.ebi.ac.uk/ena/browser/view/ERR556974 https://www.ebi.ac.uk/ena/browser/view/ERR556979 https://www.ebi.ac.uk/ena/browser/view/ERR557005
SKA1 commands:
ska fastq -o ERR556974 ERR556974.fastq.gz
ska fastq -o ERR556979 ERR556979.fastq.gz
ska fastq -o ERR557005 ERR557005.fastq.gz
ska summary ERR556974.skf ERR556979.skf ERR557005.skf > summary.txt
ska distance -s 25 -i 0.95 ERR556974.skf ERR556979.skf ERR557005.skf > distance.tsv
stout printed:
SKA: Split Kmer Analysis
Version: 1.0
Citation: TBA
Creating split 15mers for ERR557005.fastq.gz
Filters:
Coverage cutoff = 4
Per file coverage cutoff = 2
Minimum minor allele frequency = 0.2
Minimum base quality = 20
Output will be written to ERR557005.skf
Reading ERR557005.fastq.gz
Added 69977382 kmers from 3499914 sequences
7930543 unique kmers in map
Filtering kmers for file coverage
3489747 unique kmers in map after filtering
Filtering kmers for total coverage
3220444 unique kmers in map after filtering
Mean kmer coverage is 20.1622
Writing split kmers to ERR557005.skf
Done
Results:
Sample 1 Sample 2 Matches Mismatches Jaccard Index Mash-like distance SNPs SNP distance
ERR556974 ERR556979 3253428 89621 0.973192 0.000441268 2 6.14736e-07
ERR556974 ERR557005 3181584 135480 0.959157 0.000679607 2 6.28618e-07
ERR556979 ERR557005 3206931 124855 0.962526 0.000621917 2 6.23649e-07
With SKA2 I tried to match the parameters and also incorporate what you mentioned above. Only not sure whether the --min-count
is per sample or for all samples:
ska build -o seqs -k 15 --min-count 4 --min-qual 20 ERR556974.fastq.gz ERR556979.fastq.gz ERR557005.fastq.gz
ska distance --filter-ambiguous seqs.skf > distances.txt
Results:
Sample1 Sample2 Distance Mismatches
ERR556974 ERR556979 728.00 0.07357
ERR556974 ERR557005 191.00 0.04552
ERR556979 ERR557005 312.00 0.07042
(No rush to figure this out. I'm just reporting trying to understand and help improve the implementation. Thanks again for the new version!!)
One first recommendation, if you use -v
with your ska2 commands you'll get a bit more info printed out during the run which can be handy.
Although I don't think that ends up being helpful here as it doesn't show what I think is the issue, which is that by running the command in this way these are being treated as fasta files so the sequencing errors are not being filtered. Instead, these files need to be deinterleaved and given using the -f
argument to ska build
.
See: https://docs.rs/ska/latest/ska/#read-files
Although I appreciate the way currently written this might not be totally clear!
It would be easy enough for me to change the code to work with interleaved fastq files – is that a common use case you have?
(also to be comparable with ska1 you'd want -k 31
)
So with issue38_reads.txt
:
ERR556974 ERR556974_1.fastq.gz ERR556974_2.fastq.gz
ERR556979 ERR556979_1.fastq.gz ERR556979_2.fastq.gz
ERR557005 ERR557005_1.fastq.gz ERR557005_2.fastq.gz
then running:
ska build -f issue38_reads.txt -o issue38 -k 31 -v
ska distance --filter-ambiguous issue38.skf -v
I get:
SKA: Split K-mer Analysis (the alignment-free aligner)
2023-07-04T12:18:26.261Z INFO [ska] k=31: using 64-bit representation
2023-07-04T12:18:26.261Z INFO [ska::merge_ska_dict] Building skf dicts from sequence input
2023-07-04T12:18:26.261Z INFO [ska::merge_ska_dict] If FASTQ input: min count: 10; minimum quality 20 (5); filter applied: Middle base quality filtering
2023-07-04T12:18:26.261Z INFO [ska::merge_ska_dict] Build and merge serially
2023-07-04T12:19:21.894Z INFO [ska::generic_modes] Converting to array representation and saving
SKA done in 56s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
SKA: Split K-mer Analysis (the alignment-free aligner)
2023-07-04T12:21:44.543Z INFO [ska::merge_ska_array] Loading skf file
2023-07-04T12:21:44.893Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites
2023-07-04T12:21:44.915Z INFO [ska::merge_ska_array] Filtering removed 3070034 split k-mers
2023-07-04T12:21:44.915Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites or ambiguous bases
2023-07-04T12:21:44.926Z INFO [ska::merge_ska_array] Filtering removed 54 split k-mers
2023-07-04T12:21:44.926Z INFO [ska::generic_modes] Calculating distances
Sample1 Sample2 Distance Mismatches
ERR556974 ERR556979 2.00 0.03898
ERR556974 ERR557005 2.00 0.06467
ERR556979 ERR557005 3.00 0.06152
SKA done in 0s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
2023-07-04T12:21:44.931Z INFO [ska] Complete
Which looks like what we'd expect here
So my proposed solution:
Wonderful. This all makes a lot of sense and I like the proposed solutions!
I thought the -f
option is a convenience and didn't realize it's required for SKA2 to recognize them as fastq files. This is very good to know.
In the original run where I noticed the discrepancy, I did use an input TSV to list R1 and R2 reads b/c there were 45 samples. So after setting -k 31
and --filter-ambiguous
, SKA2 results now look similar to SKA1 results. I think the unexpected part for me is I ran SKA1 with mostly default parameters and it just gave good results (like what samples should have small SNP differences and cluster together), but in order for SKA2 to be comparable, certain parameters need to be set appropriately. So mentioning this in the doc will definitely be helpful!
I did not realize the test files were interleaved fastq files. I got them from ENA and SKA1 didn't complain so I never looked closely into it. For our in-house data I have never seen interleaved fastqs so I don't feel strongly about supporting this, but I do see single-end data often and if I understand SKA set up correctly, I feel treating interleaved fastq as single-end data works just fine?
Also good to know about -v
. For program that runs very fast like SKA2, I usually just sit around and watch progress in my terminal (rather than sending it to a HPC). So I wouldn't mind having verbose mode as a default so I know what's happening behind the scene, in case you wonder.
Thanks again for the help and the tool!!
Ok great, thanks for all the feedback and testing, it really helps get all the documentation and CLI use working as well as possible. I will make the suggested changes soon.
Thanks so much for the Rust implementation and various improvements! Very exciting!
I tried out ska2 v0.3.0 distance and the result is a bit unexpected so trying to understand the difference. Fresh install:
The version shows
ska 0.3.0
Prepare
ska_input.tsv
Then ran (
--threads 4
b/c there were 45 samples)In the
distances.txt
fileIf I run the same sample with the old ska installed with conda, whose version is
Version: 1.0
In the
distances.distances.tsv
So these 2 samples have 7 SNPs difference which is plausible given them came from a suspected outbreak, and none of the numbers match SKA2. Among the 45 samples, SKA picked up a nice cluster but SKA2 gave at least a few thousands in the distance column and the numbers are not integers. Reporting in case helpful. Thanks again!