Testing mSweep on C.jejuni data, questions for constructing a reference set

Hi, thanks for providing the tool and extensive documentation. I am currently trying to apply the mGEMS pipeline, starting with mSweep, on a dataset from campylobacter samples. Incidentally these are also plate sweeps for which I also have isolate sequences.

Initial tests on a sample with a known mixture using the database you provided in the paper gave some weird results, I suspected because the reference genomes are too distant. So I decided to build my own reference genome set and grouped available isolate genomes from pubmlst by ST and selected up to 5 (when more were available) genomes with different cgc_100_group (so more than 100 different alleles in cgMLST) for each ST, in the end 1715 genomes with distinct 1388 STs. [[ while I wrote this up I noticed I made a mistake in the selection, and actually this should lead to 3026 genomes with 2332 unique STs]]. I then ran the themisto pseudoalignment and mSweep (v1.4) again on the sample with the known mixture and I still get incorrect abundances / group assignments (example below).

I had hoped that maybe you could point me in the right direction on how to use the tool and to improve my understanding on how to build a more suitable reference dataset. Mainly I am trying to understand how finely grained the clustering should be.

Thanks in advance for any help and sorry if this post has gotten a bit too long :)

questions

Q1: I guess the main problem is finding the right level of dis/similarity between the reference genomes and the assigned groups. In the end the grouping should serve to capture distinct strains and guide read-binning in mGems, so for example there should not be 2 samples with the same CC/ST in a single sample? Should reference genomes within the same group be as different as possible (within the group to capture as many possible strains belonging to that group) or should they be more similar to each other?

Q2: In your testing, what is the typical behavior of mSweep when no highly similar reference genome to a strain in the mixed sample is present in the reference set? From what I see, it seems that then abundances are "split" onto the closest genomes in the reference set.

Q3: How large can the reference fasta be in order to still be able to reasonably work with themisto (on a server with 250G memory)?

some examples from the results:

In the sample I have 3 strains, ~60% ST 21, ~26% ST 3218, ~14% ST 5845

mixed sample STs

ST       aspA    glnA    gltA    glyA    pgm     tkt     uncA
ST21    2   1   5   3   2   5   19
ST 3218 64      70      22      98      123     86      16
ST5845  18      70      164     97      115     86      47

abundances with CC database (paper)

                           CC                  V2
Campylobacter_jejuni_CC21      0.911332000000
Campylobacter_jejuni_CC45      0.045833100000
Campylobacter_jejuni_CC661      0.026714400000
Campylobacter_jejuni_CC353      0.013283300000
Campylobacter_jejuni_CC607      0.002822730000

CC21 is correctly matched to the most abundant strain the two other strains present in the sample do not have a CC assigned.

Top5 groups with custom database:

ST         V2         V3         V4         V5
7440 0.43524000 0.43386900 0.43344000 0.43632600
7144 0.21560000 0.21720200 0.21750400 0.21470600
4875 0.19998300 0.19962500 0.19964200 0.19971100
6428 0.05697990 0.05731250 0.05721290 0.05699080
9389 0.03805540 0.03783970 0.03823250 0.03815700

ST 7440 (except: pgm_725), 7144 (uncA_418) have a similar MLST profile to ST21 (most abundant strain), but ST 21 (5 reference genomes) is predicted at very low abundance. ST 4875 (glyA_103), ST6428 (aspA_344) have a similar profile to the 2nd strain, the correct ST3218 (1 reference genome) is predicted at abundance 0.008. ST 9389 (gltA_77, pgm_205, tkt_743) has a similar profile to the 3rd strain, but no references with correct ST5845 were included.

Hi,

Thanks for the question! Like you wrote, it is probably an issue related to the reference sequences and their groupings.

In summary I think your approach for building the reference set sounds excellent but it could be that using > 100 alleles to define the MLSTs is too fine-grained for mSWEEP to untangle. With 100 alleles most of the strains will end up alone in their own ST although there might be very little variation between some of them. Based on your results, this indeed seems to be the case since the abundances for your 2nd strain are split between the 3 highly similar STs (ST4875 and ST3218 and ST6428) and something similar is also going on with the ST21 strain.

One potential solution could be to use PopPUNK (https://github.com/johnlees/PopPUNK) to group your STs into larger units. We have noticed in practice that the groups from PopPUNK tend to work quite well with mSWEEP and usually have meaningful biological interpretations that roughly correspond to clonal complexes but with the advantage that PopPUNK is able to place STs which don't have a defined CC. This is also the approach I would take now if I were to rewrite the mSWEEP paper from scratch :)

Also, there has been some further work in developing tools to assess the suitability of a set of reference genomes + groupings to a particular set of samples with running mSWEEP/mGEMS as the ultimate goal. demix_check (https://github.com/harry-thorpe/demix_check) works really well for solving this problem by identifying whether the mixed samples contain strains that originate from an unknown lineage that we do not have any reference sequences for in the mSWEEP database.

Hope this helps! I've answered your specific questions in more detail below. Please feel free to continue the discussion if you run into more issues.

Q1

You're right in that main problem is balancing the (dis)similarity of the sequences and the grouping. In my experience it is very difficult to select representative sequences for the groups, and it is almost always better to include more (possibly redundant) reference sequences for the group rather than to filter them down. Both the mSWEEP and mGEMS papers were written with this philosophy, meaning that we included all available sequences as the references and designed the methods to handle this case.

If the sample has 2 strains from the same CC/ST then the abundances for those will be merged and they'll be treated as the same. To get around this, you need to define the grouping at a level which separates the strains that you are interested in but does not quite go down to the level of individual sequences since those start to become increasingly difficult to identify.

Q2

Your case (abundances split into nearby clusters) is typical of what happens when there is no good match for the strain in the reference. The demix_check tool I mentioned should help in solving this problem.

Q3

Themisto can use disk space in constructing the alignment index, so it should be possible to index even very large reference sets. Pseudoaligning on large indexes should also be no problem if the sequences are somewhat similar (ie strains of the same species) since Themisto compresses the parts that are shared. Even when running on different species, it should be very efficient and I have not noticed any problems when running on < 32gb memory using 1000-10 000 reference sequences but it is hard to give any limit since the numbers will depend heavily on the composition of your reference.

Thanks again for the quick reply and the suggestions, the demix tool seems to be really useful.

I will download more references and try poppunk, I guess then there will be also some trial and error involved in decided on thresholds for clustering or can demix_check also be used for that conveniently? Previously, I was (or still am) using strainest (https://github.com/compmetagen/strainest) where I used mash distances to build a reference database. The advantage with using allelic profiles is only that I don't have to download all the sequences beforehand.

For the mSweep paper I have a bit of trouble understanding where you used which benchmarking data (only for campylobacter): For the first section single colony isolates from Yahara et al. (table S4), which you then used to make synthetic mixtures, and then you also tested mixed samples from literature (extended data, table S6)? Are the synthetic mixtures the files available on figshare and is there a table summarizing which isolates are contained in which mixture?

I don't think the demix_check tool can be used to automatically decide the clustering. It is more tailored for detecting whether the samples are adequately covered by the sequences in a reference, although it might be possible to use the tool manually for choosing a clustering out of some possible options.

In the mSWEEP paper some of the isolates from Yahara et al. were indeed used to make the synthetic mixtures. As for the "real" mixed samples, some of them had been published earlier in the literature (the samples in extended data table S6) and some were new samples that we made available alongside the paper (https://doi.org/10.6084/m9.figshare.6445136.v1 and https://doi.org/10.6084/m9.figshare.6445190.v1).

The synthetic mixtures are not available anywhere as far as I know, sorry. If you need to create similar samples for benchmarking purposes, the tool+code I used for mixing the reads is available in github (https://github.com/tmaklin/sweepsim). I'll have to have a closer look at the supplementary files to see if the list of isolates in each sample was included somewhere.

PROBIC / mSWEEP