bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
56 stars 4 forks source link

The distance result is very different from the old version #38

Closed danrlu closed 1 year ago

danrlu commented 1 year ago

Thanks so much for the Rust implementation and various improvements! Very exciting!

I tried out ska2 v0.3.0 distance and the result is a bit unexpected so trying to understand the difference. Fresh install:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo install ska

The version shows ska 0.3.0

Prepare ska_input.tsv

A   A_R1.fq.gz  A_R2.fq.gz
B   B_R1.fq.gz  B_R2.fq.gz

Then ran (--threads 4 b/c there were 45 samples)

ska build -o seqs --threads 4 -f ska_input.tsv
ska distance seqs.skf > distances.txt

In the distances.txt file

Sample1 Sample2 Distance    Mismatches
A   B   12465.53    0.02555

If I run the same sample with the old ska installed with conda, whose version is Version: 1.0

ska fastq -o A A_R1.fq.gz A_R2.fq.gz
ska fastq -o B B_R1.fq.gz B_R2.fq.gz
ska distance -s 25 -i 0.95 A.skf B.skf > distance.tsv

In the distances.distances.tsv

Sample 1    Sample 2    Matches Mismatches  Jaccard Index   Mash-like distance  SNPs    SNP distance
A   B   2700348 30201   0.98894 0.000179886 7   2.59226e-06

So these 2 samples have 7 SNPs difference which is plausible given them came from a suspected outbreak, and none of the numbers match SKA2. Among the 45 samples, SKA picked up a nice cluster but SKA2 gave at least a few thousands in the distance column and the numbers are not integers. Reporting in case helpful. Thanks again!

johnlees commented 1 year ago

This could be a few things:

It could also be a bug in the implementation. If you are able to send me the input files A_R1.fq.gz A_R2.fq.gz B_R1.fq.gz B_R2.fq.gz I can take a look

johnlees commented 1 year ago

Testing on fasta input gives me confidence that the distance is working properly, and this is likely due to a filtering criteria with fastq read input:

SKA1

ska fasta 19183_8#29.contigs_velvet.fa -o 19183_8#29
ska fasta 2673_8\#29.contigs_velvet.fa -o 12673_8#29
ska distance *.skf
cat distances.distances.tsv

gives

Sample 1        Sample 2        Matches Mismatches      Jaccard Index   Mash-like distance      SNPs    SNP distance
12673_8#29      19183_4#73      2874011 129683  0.956825        0.000719697     44      1.53096e-05

SKA2

ska build -k 31 12673_8\#29.contigs_velvet.fa 19183_4#73.contigs_velvet.fa -o dist_test -v
ska distance dist_test.skf -v

gives

Sample1 Sample2 Distance        Mismatches
12673_8#29.contigs_velvet       19183_4#73.contigs_velvet       45.50   0.04317

or adding --filter-ambiguous which reports removing 34 split k-mers:

Sample1 Sample2 Distance        Mismatches
12673_8#29.contigs_velvet       19183_4#73.contigs_velvet       41.00   0.04317

The mismatches/matches proportion are the same. The result is very similar, but not identical to the previous version.

Alignment with ska2:

>12673_8#29.contigs_velvet
AAGCTATCCCCTATGTCGAGCACGGGCGTGAGGCACAAAAA
>19183_4#73.contigs_velvet
GGATCTCTTTTCCCACTTGATGTAATTAGAGAATGTGGTGG

With ska1:

>12673_8#29
AGCTCATAGGGTAAGAGTGTGATTCAAGCTAACGGGTCCAAGCC
>19183_4#73
TAACTGCGAAAGGGATACAAATGATGGATCGGTATACTTGGATT

ska1 humanise and filtering

CGGCGAGGTCAGCTGCGATTGATAATAATC  G       A
ATATATTTGCCATTTAAAGTAGTTATTTCG  C       T
AAATATGTGGCTGCGGTGATGGGAAACAGT  G       A
ATAAAAATTAAAAATGTAACAAAATTTCCA  G       A
ACTTAGAAAAAATAACACTAATAATTAGTA  G       A
TCGCTTACTTGTCCACTTGGATTGCGCCAA  A       G
AATCACCTCGTAAAATTTTACGAGGTGATT  T       A
ACCGAGCGACTGTAAAGAAGCCAATCTAAA  C       T
GGTAGAAATACGATGCAGTCATTTCTGATA  G       A
AGAGTCTTTAGAGCCTAACGACGAACATTA  G       A
GTTGGCTTTGTATTTTAATGGCGGTGTTGA  G       A
GCCGGAATCAATTTTTACAAGCACAAGAAA  T       A
CATTAACTCGCGAAATCGCTCATTGCACCA  T       C
ACCACAACCGCGACAATGAAATAATCATTA  G       A
CGATGCGAAAGAATCCTTACTTCTTCAAAA  A       T
GAAAAGCGAAGACTTTTTAATGTAGAAAAA  C       A
GCTATTTTTGCAATTTCATGGCGACAAGAC  T       G
ACTAATGCTTGGTTCTGCTTATACTTGTCC  G       A
AAAATTCCTGGTGTCCTGCTATGTCTCCAC  C       T
ACAATACAAACGTATTTAGAGCTTGATGAT  G       A
AAGTGCCGCTTTCGGCCAATTACAGTTCCG  C       T
CGGATGGGACAACGGAATCACAATAATTCC  A       G
AAACTAGAATATACATGGGTAACTATCAAG  A       G
GGGAGTTAGTGGAGCATTGCAGCTGCGCCA  T       C
TATCAGGTAGCGCAGACCACCAGATTTACA  T       C
CAGATGATGAAGGTACTTGGTCGTCGAAGG  C       T
ATCCCAGCCAAAGCAAGAATGATCAATCCC  C       T
AGACAAAGACGGTAACCAATCACAGGCCTA  A       G
GACCGCTTTGAAACGATGCAGCTGGAGGAA  A       G
ATCCACAACAATAACCGGCTTTCCTACTAA  C       T
AAACGTAGACTTGTCCGCGAATTCGTTAAC  T       C
CTGGAGCCTCCGCTTTACTAGTGGAGTTTC  A       G
AAAACCAGCGAATGTGATACAGAAGAAATC  A       G
AAAAGAAATATATTCGGTTTCACAGGACCA  C       T
ATCATAAACCCCAAGCCAATTTCCTTATCA  A       G
ATTTGCATTTTATTGCAATAAAATGCAAAT  A       T
TATAATCCGGAAGCCCAATACGCTTTGACA  A       G
AAAGCCAGCAATTTTAAAATTGCTGGCTTT  A       T
TAATAAACCAGAGTAAAAGTAAACAAGTAA  T       C
GTTAAACATACCACGATTTGTGGCATCGTA  T       G
ATCTGTTGCAGCTAAATAGTTCCTGTTGGA  A       G
AGCAGCAAGATTTACCTGGATGCTTTGCCC  G       A
GAAGCGCTGCAAGATCCGAGCAAAAAGCGA  G       A
ATTCGAGCTGATCTATGATTGCCTAATTTG  G       T

ska2 nk --full-info:

ATCATAAACCCCAAG CCAATTTCCTTATCA A G
AAAACCAGCGAATGT GATACAGAAGAAATC A G
ATAAAAATTAAAAAT GTAACAAAATTTCCA G A
ATATATTTGCCATTT AAAGTAGTTATTTCG C T
AAACGTAGACTTGTC CGCGAATTCGTTAAC T C
TTTCTTGTGCTTGTA AAAATTGATTCCGGC A T
TAATAAACCAGAGTA AAAGTAAACAAGTAA T C
AAAAGAAATATATTC GGTTTCACAGGACCA C T
ATCCCAGCCAAAGCA AGAATGATCAATCCC C T
TATCAGAAATGACTG CATCGTATTTCTACC C T
ATCCACAACAATAAC CGGCTTTCCTACTAA C T
CATTAACTCGCGAAA TCGCTCATTGCACCA T C
TACGATGCCACAAAT CGTGGTATGTTTAAC A C
TTCCTCCAGCTGCAT CGTTTCAAAGCGGTC T C
ACTAATGCTTGGTTC TGCTTATACTTGTCC G A
TATCAGGTAGCGCAG ACCACCAGATTTACA T C
AAGTGCCGCTTTCGG CCAATTACAGTTCCG C T
ATTCGAGCTGATCTA TGATTGCCTAATTTG G T
AGACAAAGACGGTAA CCAATCACAGGCCTA A G
ACCACAACCGCGACA ATGAAATAATCATTA G A
CAGATGATGAAGGTA CTTGGTCGTCGAAGG C T
CTGGAGCCTCCGCTT TACTAGTGGAGTTTC A G
AAAATTCCTGGTGTC CTGCTATGTCTCCAC C T
ACAATACAAACGTAT TTAGAGCTTGATGAT G A
AAATATGTGGCTGCG GTGATGGGAAACAGT G A
TTTTTCTACATTAAA AAGTCTTCGCTTTTC G T
ACCGAGCGACTGTAA AGAAGCCAATCTAAA C T
ACTTAGAAAAAATAA CACTAATAATTAGTA G A
GCTATTTTTGCAATT TCATGGCGACAAGAC T G
CGGCGAGGTCAGCTG CGATTGATAATAATC G A
TATAATCCGGAAGCC CAATACGCTTTGACA A G
AGAGTCTTTAGAGCC TAACGACGAACATTA G A
AGCAGCAAGATTTAC CTGGATGCTTTGCCC G A
TCAACACCGCCATTA AAATACAAAGCCAAC C T
CGGATGGGACAACGG AATCACAATAATTCC A G
TCGCTTTTTGCTCGG ATCTTGCAGCGCTTC C T
AAACTAGAATATACA TGGGTAACTATCAAG A G
TCGCTTACTTGTCCA CTTGGATTGCGCCAA A G
CGATGCGAAAGAATC CTTACTTCTTCAAAA A T
ATCTGTTGCAGCTAA ATAGTTCCTGTTGGA A G
TGGCGCAGCTGCAAT GCTCCACTAACTCCC A G

The three split k-mers in ska1 and not in ska2 are:

AAAGCCAGCAATTTTAAAATTGCTGGCTTT A T
AATCACCTCGTAAAATTTTACGAGGTGATT T A
ATTTGCATTTTATTGCAATAAAATGCAAAT A T

These are all palindromes so end up as 'W' as we don't know which strand they are on (i.e. A or T for the middle base), which I think is better behaviour.

danrlu commented 1 year ago

(Sorry for the delay! I was out traveling.)

My previous test data was not published, so I reran with the data below. They came from a Legionella outbreak investigation here. See Table 4 and Figure 3 for the corresponding samples (sample 125, 192091 and 4029) and tree with SNP distance using more conventional SNP calling. https://www.ebi.ac.uk/ena/browser/view/ERR556974 https://www.ebi.ac.uk/ena/browser/view/ERR556979 https://www.ebi.ac.uk/ena/browser/view/ERR557005

SKA1 commands:

ska fastq -o ERR556974 ERR556974.fastq.gz
ska fastq -o ERR556979 ERR556979.fastq.gz
ska fastq -o ERR557005 ERR557005.fastq.gz

ska summary ERR556974.skf ERR556979.skf ERR557005.skf > summary.txt

ska distance -s 25 -i 0.95 ERR556974.skf ERR556979.skf ERR557005.skf > distance.tsv

stout printed:

SKA: Split Kmer Analysis
Version: 1.0
Citation: TBA

Creating split 15mers for ERR557005.fastq.gz 
Filters:
    Coverage cutoff = 4
    Per file coverage cutoff = 2
    Minimum minor allele frequency = 0.2
    Minimum base quality = 20
Output will be written to ERR557005.skf

Reading ERR557005.fastq.gz
Added 69977382 kmers from 3499914 sequences
7930543 unique kmers in map
Filtering kmers for file coverage
3489747 unique kmers in map after filtering
Filtering kmers for total coverage
3220444 unique kmers in map after filtering
Mean kmer coverage is 20.1622
Writing split kmers to ERR557005.skf
Done

Results:

Sample 1    Sample 2    Matches Mismatches  Jaccard Index   Mash-like distance  SNPs    SNP distance
ERR556974   ERR556979   3253428 89621   0.973192    0.000441268 2   6.14736e-07
ERR556974   ERR557005   3181584 135480  0.959157    0.000679607 2   6.28618e-07
ERR556979   ERR557005   3206931 124855  0.962526    0.000621917 2   6.23649e-07

With SKA2 I tried to match the parameters and also incorporate what you mentioned above. Only not sure whether the --min-count is per sample or for all samples:

ska build -o seqs -k 15 --min-count 4 --min-qual 20 ERR556974.fastq.gz ERR556979.fastq.gz ERR557005.fastq.gz

ska distance --filter-ambiguous seqs.skf > distances.txt

Results:

Sample1 Sample2 Distance    Mismatches
ERR556974   ERR556979   728.00  0.07357
ERR556974   ERR557005   191.00  0.04552
ERR556979   ERR557005   312.00  0.07042

(No rush to figure this out. I'm just reporting trying to understand and help improve the implementation. Thanks again for the new version!!)

johnlees commented 1 year ago

One first recommendation, if you use -v with your ska2 commands you'll get a bit more info printed out during the run which can be handy.

Although I don't think that ends up being helpful here as it doesn't show what I think is the issue, which is that by running the command in this way these are being treated as fasta files so the sequencing errors are not being filtered. Instead, these files need to be deinterleaved and given using the -f argument to ska build. See: https://docs.rs/ska/latest/ska/#read-files Although I appreciate the way currently written this might not be totally clear!

It would be easy enough for me to change the code to work with interleaved fastq files – is that a common use case you have?

(also to be comparable with ska1 you'd want -k 31)

So with issue38_reads.txt:

ERR556974       ERR556974_1.fastq.gz    ERR556974_2.fastq.gz
ERR556979       ERR556979_1.fastq.gz    ERR556979_2.fastq.gz
ERR557005       ERR557005_1.fastq.gz    ERR557005_2.fastq.gz

then running:

ska build -f issue38_reads.txt -o issue38 -k 31 -v
ska distance --filter-ambiguous issue38.skf -v

I get:

SKA: Split K-mer Analysis (the alignment-free aligner)
2023-07-04T12:18:26.261Z INFO  [ska] k=31: using 64-bit representation
2023-07-04T12:18:26.261Z INFO  [ska::merge_ska_dict] Building skf dicts from sequence input
2023-07-04T12:18:26.261Z INFO  [ska::merge_ska_dict] If FASTQ input: min count: 10; minimum quality 20 (5); filter applied: Middle base quality filtering
2023-07-04T12:18:26.261Z INFO  [ska::merge_ska_dict] Build and merge serially
2023-07-04T12:19:21.894Z INFO  [ska::generic_modes] Converting to array representation and saving
SKA done in 56s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
SKA: Split K-mer Analysis (the alignment-free aligner)
2023-07-04T12:21:44.543Z INFO  [ska::merge_ska_array] Loading skf file
2023-07-04T12:21:44.893Z INFO  [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites
2023-07-04T12:21:44.915Z INFO  [ska::merge_ska_array] Filtering removed 3070034 split k-mers
2023-07-04T12:21:44.915Z INFO  [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites or ambiguous bases
2023-07-04T12:21:44.926Z INFO  [ska::merge_ska_array] Filtering removed 54 split k-mers
2023-07-04T12:21:44.926Z INFO  [ska::generic_modes] Calculating distances
Sample1 Sample2 Distance        Mismatches
ERR556974       ERR556979       2.00    0.03898
ERR556974       ERR557005       2.00    0.06467
ERR556979       ERR557005       3.00    0.06152
SKA done in 0s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
2023-07-04T12:21:44.931Z INFO  [ska] Complete

Which looks like what we'd expect here

So my proposed solution:

danrlu commented 1 year ago

Wonderful. This all makes a lot of sense and I like the proposed solutions!

I thought the -f option is a convenience and didn't realize it's required for SKA2 to recognize them as fastq files. This is very good to know.

In the original run where I noticed the discrepancy, I did use an input TSV to list R1 and R2 reads b/c there were 45 samples. So after setting -k 31 and --filter-ambiguous, SKA2 results now look similar to SKA1 results. I think the unexpected part for me is I ran SKA1 with mostly default parameters and it just gave good results (like what samples should have small SNP differences and cluster together), but in order for SKA2 to be comparable, certain parameters need to be set appropriately. So mentioning this in the doc will definitely be helpful!

I did not realize the test files were interleaved fastq files. I got them from ENA and SKA1 didn't complain so I never looked closely into it. For our in-house data I have never seen interleaved fastqs so I don't feel strongly about supporting this, but I do see single-end data often and if I understand SKA set up correctly, I feel treating interleaved fastq as single-end data works just fine?

Also good to know about -v. For program that runs very fast like SKA2, I usually just sit around and watch progress in my terminal (rather than sending it to a HPC). So I wouldn't mind having verbose mode as a default so I know what's happening behind the scene, in case you wonder.

Thanks again for the help and the tool!!

johnlees commented 1 year ago

Ok great, thanks for all the feedback and testing, it really helps get all the documentation and CLI use working as well as possible. I will make the suggested changes soon.