We need to determine the best way to identify a reference species based on the assembled sequences (contigs.fa). Current ideas have touched upon MASH, but also Jspecies, etc. Still no solution.

This is a suggestion I got from the microbioinfo Slack people @b16joski and @boulund :

Adam Phillippy (the mash guy) also writes: FYI, We are releasing an improved ANI tool in a few weeks called “FastANI” that is Mash-like but specifically designed for ANI and more accurate/sensitive. Preprint coming soon... Also fast, as the name implies ;)

I just had a quick look at the pyani Github repo, and I like the pyani suggestion. I think it would work well for our purposes. It doesn't have a lot of extra dependencies that are hard to install either. If the MUMer-based approach (ANIm) works well in our case, I think that's the best route to aim for. Not entirely sure how well it will perform if we have a very large database of reference genomes to compare to... I guess we'll have to try and see!

It might still makes sense to compare to specI as well. According to their paper (of course) specI has slightly better accuracy in assignments than ANI using NUCmer (nucleotide alignment-part of MUMer; ran as part of JSpecies), and a lot better runtime performance (i.e. it's much faster).

Gonna test pyani on some reference genomes this week to see how it compares with species

Test of Mash screen

Ok, so I've made a bit of research here. This started out as me wanting to test Mash for screening (for issue #26) but ended up a bit larger than that... Mash for screening seems to work fairly well, but I think we can also tweak it to solve the species identification issue that we've been discussing to solve with pyani or something like that. Yesterday, I made a really quick test of pyani against all of refseq using a single set of H. pylori contigs, just because I wanted to see just how long it would take. I turns out it didn't manage to finish at all. It died sometime last night due to memory contraints. Huh, so I guess running pyani with nucmer against all of refseq is not an option.

Background

There's a nice writeup of the Mash approach to k-mer screening in a blog post: https://genomeinformatics.github.io/mash-screen/ I suggest you have a look if you haven't seen it already.

Method

This morning, I tested Mash screen a bit, intending to just make a quick test to see how it works. I used the screening method described on the Mash tutorial homepage, with their refseq.genomes.k21.s1000.msh that I downloaded to /db/refseq/ on CTMR-NAS. Then I ran

mash screen -w -p 8 /db/refseq/refseq.genomes.k21s1000.msh <path/to/sample_file.fastq.gz>

for my different sample files, described below.

I whipped together a small, quick'n'dirty Python script for assessing the output from mash screen, just to get a feel for if it works at all for our use case, and how much extra tooling we need to develop in order to get something useful out of it. The code is here: https://github.com/boulund/mash_scripts/blob/master/assess_mash_screen.py

Samples

I tried it with a Helicobacter pylori sample (G162-C, from an old project me and Kaisa worked on), both using the raw reads and the contigs. I also tested with 171013_M03284_0074_000000000-BBMHR/Unaligned-Y301I8I8Y301/Project_KaisaWGS171004/Sample_171013-P280/171013-P280_ATTACTCG-AGGCGAAG_L001_R{1,2}*gz from the INBOX on CTMR-NAS.

Results

When run on the G162-C contigs:

[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py screen.tab
The sample probably consist of only a single species: Helicobacter pylori

When run on the G162-C raw reads:

[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py reads_screen.tab
The sample probably consist of only a single species: Helicobacter pylori

When run on Sample_171013-P280:

[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py testwgs.tab
WARNING: The sample likely contains more than one species: Paraclostridium bifermentans, Candidatus Dorea, Clostridium sp., Paraclostridium benzoelyticum
0.990364        (816, 1000)     59      0.0     GCF_000498455.1_WYM_1.0_genomic.fna.gz      [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]
0.988838        (790, 1000)     59      0.0     GCF_000452225.2_ASM45222v2_genomic.fna.gz   [58 seqs] NZ_AVNB01000001.1 Paraclostridium bifermentans ATCC 19299 gcdATCC19299.contig.0, whole genome shotgun sequence [...]
0.97868 (636, 1000)     59      0.0     GCF_000764005.1_ASM76400v1_genomic.fna.gz  [56 seqs] NZ_JQHY01000001.1 Clostridium sp. NCR contig00001, whole genome shotgun sequence [...]
0.977343        (618, 1000)     59      0.0     GCF_001282705.1_Dorea_massiliensis_genomic.fna.gz   [36 seqs] NZ_LN876588.1 Candidatus Dorea massiliensis AP6 genome assembly Dorea massiliensis, scaffold scaffold00001, whole genome shotgun sequence [...]
0.970481        (533, 1000)     60      0.0     GCF_000452245.2_ASM45224v2_genomic.fna.gz   [22 seqs] NZ_AVNC01000001.1 Paraclostridium bifermentans ATCC 638 gcdATCC638.contig.0, whole genome shotgun sequence [...]
0.961242        (436, 1000)     59      0.0     GCF_001006285.1_ASM100628v1_genomic.fna.gz  [368 seqs] NZ_LBBT01000001.1 Paraclostridium benzoelyticum strain JC272 Contig1, whole genome shotgun sequence [...]

Conclusions

So, from this I conclude that it's probably feasible to implement mash screen as a quick way to screen for contaminated samples, but I'm not entirely sure how we should inform the user of this, and how to handle the different scenarios that can occur.

I would like to investigate further how accurate it is on a couple of samples that we've run before, so we can determine good cutoffs for the different parameters. Currently the only features I use to assess if a sample consists of multiple species is identity and shared_hashes. Currently, I set min_identity = 0.85 and min_shared_hashes_threshold to 10% of the number of shared_hashes of the best match (i..e with the highest identity of all matches for this sample).

I think we could use mash screen as a way to get a usable indication of what species, and thus what reference genome to use for genome annotation downstream, after assembly. I see two cases:

We only detect a single species in the sample using mash screen. Then we use mash screen both as an initial screen to detect potential contamination of samples, but also as a way to identify what species is in the sample. If we only detect one species, great! We can now send that information into a separate processing channel in Nextflow so that a suitable reference genome (and its annotation) can be downloaded so it's ready for use when the genome assembly is completed (parallelization, yeay!).
We detect multiple species in a sample using mash screen. Then we use the output from mash screen to download all the potential reference genomes that were identified (maybe limit to the top 10 best hits or something), and then run a separate pyani-step after assembly, before genome annotation, to pick out the best match against the few genomes we compare against.

What do you think? Any comments?

Great work, really nice to see how things work out.

As for the identification of contaminated samples I think that the pipeline should just abort with an error message/printing of an error log file if there is more than one species detected. And if the user absolutely wants to assemble the genomes anyway, this has to be done manually? Alternatively that there could be some kind of -force option that tells the pipeline to ignore that error. But that might be a bit overkill because to make any sense of the output the contigs of the different genomes needs to be classified/mapped and separated before annotation anyway and there are a lot of potential complications with multi-species assemblies so the usefulness of this is anyway a bit unclear.

Do you think the mash screen could replace pyani entirely or should we still use both steps? The cases where we would like to download and use a reference genome is when there is a complete reference or representative genome for that species, at least that is what we used in Sandras version. It needs to be complete to be useful in Mauve and if it is not I think the usefulness for annotation purposes is probably limited too...

Best wishes,

Kaisa

I think mash could replace the function we would like for

On 22 Nov 2017, at 15:27, Fredrik Boulund notifications@github.com wrote:

Test of Mash screen

Ok, so I've made a bit of research here. This started out as me wanting to test Mash for screen (for issue #26 https://github.com/ctmrbio/BACTpipe/issues/26) but ended up a bit larger than that... Mash for screening seems to work fairly well, but I think we can also tweak it to solve the species identification issue that we've been discussing to solve with pyani or something like that. Yesterday, I made a really quick test of pyani against all of refseq using a single set of H. pylori contigs, just because I wanted to see just how long it would take. I turns out it didn't manage to finish at all. I died sometime last night due to memory contraints. Huh, so I guess running pyani with nucmer is not an option.

Background

There's a nice writeup of the Mash approach to k-mer screening in a blog post: https://genomeinformatics.github.io/mash-screen/ https://genomeinformatics.github.io/mash-screen/ I suggest you have a look if you haven't seen it already.

Method

This morning, I tested Mash screen a bit, intending to just make a quick test to see how it works. I used the screening method described on the Mash tutorial homepage http://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes, with their refseq.genomes.k21.s1000.msh https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh that I downloaded to /db/refseq/ on CTMR-NAS. Then I ran

mash screen -w -p 8 /db/refseq/refseq.genomes.k21s1000.msh <path/to/sample_file.fastq.gz> for my different sample files, described below.

I whipped together a small, quick'n'dirty Python script for assessing the output from mash screen, just to get a feel for if it works at all for our use case, and how much extra tooling we need to develop in order to get something useful out of it. The code is here: https://github.com/boulund/mash_scripts/blob/master/assess_mash_screen.py https://github.com/boulund/mash_scripts/blob/master/assess_mash_screen.py Samples

I tried it with a Helicobacter pylori sample (G162-C, from an old project me and Kaisa worked on), both using the raw reads and the contigs. I also tested with 171013_M03284_0074_000000000-BBMHR/Unaligned-Y301I8I8Y301/Project_KaisaWGS171004/Sample_171013-P280/171013-P280_ATTACTCG-AGGCGAAG_L001_R{1,2}*gz from the INBOX on CTMR-NAS.

Results

When run on the G162-C contigs:

[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py screen.tab The sample probably consist of only a single species: Helicobacter pylori When run on the G162-C raw reads:

[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py reads_screen.tab The sample probably consist of only a single species: Helicobacter pylori When run on Sample_171013-P280:

[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py testwgs.tab WARNING: The sample likely contains more than one species: Paraclostridium bifermentans, Candidatus Dorea, Clostridium sp., Paraclostridium benzoelyticum 0.990364 (816, 1000) 59 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.988838 (790, 1000) 59 0.0 GCF_000452225.2_ASM45222v2_genomic.fna.gz [58 seqs] NZ_AVNB01000001.1 Paraclostridium bifermentans ATCC 19299 gcdATCC19299.contig.0, whole genome shotgun sequence [...] 0.97868 (636, 1000) 59 0.0 GCF_000764005.1_ASM76400v1_genomic.fna.gz [56 seqs] NZ_JQHY01000001.1 Clostridium sp. NCR contig00001, whole genome shotgun sequence [...] 0.977343 (618, 1000) 59 0.0 GCF_001282705.1_Dorea_massiliensis_genomic.fna.gz [36 seqs] NZ_LN876588.1 Candidatus Dorea massiliensis AP6 genome assembly Dorea massiliensis, scaffold scaffold00001, whole genome shotgun sequence [...] 0.970481 (533, 1000) 60 0.0 GCF_000452245.2_ASM45224v2_genomic.fna.gz [22 seqs] NZ_AVNC01000001.1 Paraclostridium bifermentans ATCC 638 gcdATCC638.contig.0, whole genome shotgun sequence [...] 0.961242 (436, 1000) 59 0.0 GCF_001006285.1_ASM100628v1_genomic.fna.gz [368 seqs] NZ_LBBT01000001.1 Paraclostridium benzoelyticum strain JC272 Contig1, whole genome shotgun sequence [...] Conclusions

So, from this I conclude that it's probably feasible to implement mash screen as a quick way to screen for contaminated samples, but I'm not entirely sure how we should inform the user of this, and how to handle the different scenarios that can occur.

I would like to investigate further how accurate it is on a couple of samples that we've run before, so we can determine good cutoffs for the different parameters. Currently the only features I use to assess if a sample consists of multiple species is identity and shared_hashes. Currently, I set min_identity = 0.85 and min_shared_hashes_threshold to 10% of the number of shared_hashes of the best match (i..e with the highest identity of all matches for this sample).

I think we could use mash screen as a way to get a usable indication of what species, and thus what reference genome to use for genome annotation downstream, after assembly. I see two cases:

We only detect a single species in the sample using mash screen. Then we use mash screen both as an initial screen to detect potential contamination of samples, but also as a way to identify what species is in the sample. If we only detect one species, great! We can now send that information into a separate processing channel in Nextflow so that a suitable reference genome (and its annotation) can be downloaded so it's ready for use when the genome assembly is completed (parallelization, yeay!). We detect multiple species in a sample using mash screen. Then we use the output from mash screen to download all the potential reference genomes that were identified (maybe limit to the top 10 best hits or something), and then run a separate pyani-step after assembly, before genome annotation, to pick out the best match against the few genomes we compare against. What do you think? Any comments?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ctmrbio/BACTpipe/issues/25#issuecomment-346365412, or mute the thread https://github.com/notifications/unsubscribe-auth/AMfZmGq5MHc31YzLBHU-gLO3n3EW_9l9ks5s5C9lgaJpZM4QMphL.

Nice that you had time to look at this @thorellk :)

I agree with you that it's a bit strange to continue the workflow if the screening step detects more than one species in the sample. However, it could be nice to have it continue e.g. to just until the assembly step, to make it easier if the user would want to continue manually working with the data from there. Another suitable step other than assembly could also be used as a stop in these cases (maybe just QC, trimming, etc.?).

We still need to test Mash a bit more on some samples that we know the contents of, to at least get a better sense of how it performs for the most common organisms. Like in the example I tried above, with Sample_171013-P280, showed that the sample might contain multiple species, but I'm not entirely convinced of the output. As far as I can tell, the suggested "different" species could very well be the same species. Do you remember what it contained? Using mash screen is probably not going to be super robust in these situations, so we need to get a better feel for what thresholds to use.

If we manage to test Mash on a larger set of samples, and it performs well, I definitely think it can replace our plans with a separate pyani-step after assembly. It's much cleaner than using pyani, with less dependencies and smaller reference database. I would really like it if this worked out :)

That sample (ACHIM-55) is actually, according to SILVA, only containing rRNA from one species: Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Paraclostridium;

jspecies TCS says Clostridium sp. NCR (Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae) but it always only gives one hit so it wouldn’t detect contamination anyway.

So it could be that mash in this case is overdoing it. It is also one of those bacterial taxa that is really messy (several Clostridia have been reclassified to Paraclostridium etc) so it is not one of the easiest to start with.

But that pinpoints the challenges with the species determination quite nicely on the other hand…

I agree with you that performing trimming and QC might be good regardless but to not proceed beyond that point if it is suspected to be contaminated.

BW,

Kaisa

On 23 Nov 2017, at 16:35, Fredrik Boulund notifications@github.com wrote:

Nice that you had time to look at this @thorellk :)

I agree with you that it's a bit strange to continue the workflow if the screening step detects more than one species in the sample. However, it could be nice to have it continue e.g. to just until the assembly step, to make it easier if the user would want to continue manually working with the data from there. Another suitable step other than assembly could also be used as a stop in these cases (maybe just QC, trimming, etc.?).

We still need to test Mash a bit more on some samples that we know the contents of, to at least get a better sense of how it performs for the most common organisms. Like in the example I tried above, with Sample_171013-P280, showed that the sample might contain multiple species, but I'm not entirely convinced of the output. As far as I can tell, the suggested "different" species could very well be the same species. Do you remember what it contained? Using mash screen is probably not going to be super robust in these situations, so we need to get a better feel for what thresholds to use.

If we manage to test Mash on a larger set of samples, and it performs well, I definitely think it can replace our plans with a separate pyani-step after assembly. It's much cleaner than using pyani, with less dependencies and smaller reference database. I would really like it if this worked out :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Test of mash screen with known genomes

Background

Diffrent sample types including H.pylori 26695 and 2 Kalix strains, ACHIM and Ecoli raw fastq.gz files were screened in mash in order to rapidly estimate resemblance with the mash screen database of ref genomes.

Method

I downloaded the mash ref Seq database refseq.genomes.k21.s1000.msh /home/josephk/joseph/nextflow_bactpipe/mash_screen/refseqsketches.msh Created a folder of test raw reads /home/josephk/joseph/nextflow_bactpipe/mash_screen/test_genomes Created a simple bash script with mash screen command

#!/usr/bin/env bash
seq='/home/josephk/joseph/nextflow_bactpipe/mash_screen/test_genomes/*_R1.fastq.gz'
ref='/home/josephk/joseph/nextflow_bactpipe/mash_screen/refseqsketches.msh/refseq.genomes*'
echo $seq
for file1 in $seq
do
        echo 'starting mash screen'
        name=$(basename $file1)
        echo $name
        file2=$(echo $file1 | sed 's/\_R1/\_R2/')
        mash screen -w -p 8 $ref $file1 $file2 > $name.tab
        sort -gr $name.tab | head -5
        python ./assses_mash_screen.py $name.tab -o $name.txt
done

Results

For H.pylori 26695

Writing output...
1   1000/1000   191 0   GCF_000008525.1_ASM852v1_genomic.fna.gz NC_000915.1 Helicobacter pylori 26695 chromosome, complete genome
0.88561 78/1000 1   3.40572e-243    GCF_000470135.1_HP87hu_genomic.fna.gz   [72 seqs] NZ_CBRI010000001.1 Helicobacter pylori HP87hu WGS project CBRI000000000 data, contig 00001, whole genome shotgun sequence [...]
0.873406    38/652  7   2.10293e-114    GCF_000910735.1_ViralProj214366_genomic.fna.gz  NC_021929.1 Malvastrum leaf curl Philippines betasatellite, complete sequence
0.858899    41/1000 2   6.9149e-117 GCF_000824885.1_H3016_genomic.fna.gz    [101 seqs] NZ_CCMT01000002.1 Helicobacter pylori genome assembly H3016, contig H3016_001, whole genome shotgun sequence [...]
0.856856    39/1000 1   2.09924e-110    GCF_000287295.1_ASM28729v1_genomic.fna.gz   NC_018417.1 Candidatus Carsonella ruddii HT isolate Thao2000, complete genome

For ACHIM_22-1

Writing output...
0.990652    821/1000    81  0   GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]
0.985891    742/1000    24  0   GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz  [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...]
0.889242    85/1000 23  2.54906e-296    GCF_001481725.1_ASM148172v1_genomic.fna.gz  NZ_CP011914.1 Eubacterium limosum strain SA11, complete genome
0.887203    81/1000 83  1.07846e-280    GCF_000452225.2_ASM45222v2_genomic.fna.gz   [58 seqs] NZ_AVNB01000001.1 Paraclostridium bifermentans ATCC 19299 gcdATCC19299.contig.0, whole genome shotgun sequence [...]
0.874487    39/652  3   9.64658e-131    GCF_000910735.1_ViralProj214366_genomic.fna.gz  NC_021929.1 Malvastrum leaf curl Philippines betasatellite, complete sequence

For ACHIM_51

Writing output...
0.998941    978/1000    107 0   GCF_000712925.1_psaLyso211_genomic.fna.gz   [49 seqs] NZ_JOPV01000001.1 Staphylococcus warneri Lyso 2 2011 psaLyso211.contig.0, whole genome shotgun sequence [...]
0.852771    23/652  2   6.06497e-80 GCF_000910735.1_ViralProj214366_genomic.fna.gz  NC_021929.1 Malvastrum leaf curl Philippines betasatellite, complete sequence
0.834744    15/666  13  1.04014e-49 GCF_000922435.1_ViralProj259986_genomic.fna.gz  NC_024777.1 Small begomovirus-associated satellite isolate Sa19-S1, complete sequence
0.831966    21/1000 4   2.45625e-68 GCF_000920515.2_ViralProj243492_genomic.fna.gz  NC_023876.2 Andrographis yellow vein leaf curl betasatellite clone bt-2, complete sequence
0.828011    19/1000 4   3.83459e-61 GCF_001429995.1_ViralProj300248_genomic.fna.gz  NC_028116.1 Ageratum conyzoides symptomless alphasatellite isolate WOK80, complete sequence

For H.pylori Kx-1090A

0.974793    585/1000    148 0   GCF_000498335.1_ASM49833v1_genomic.fna.gz   NC_022911.1 Helicobacter pylori BM012S, complete genome
0.92688 203/1000    144 0   GCF_000824885.1_H3016_genomic.fna.gz    [101 seqs] NZ_CCMT01000002.1 Helicobacter pylori genome assembly H3016, contig H3016_001, whole genome shotgun sequence [...]
0.891666    90/1000 135 0   GCF_000013245.1_ASM1324v1_genomic.fna.gz    [2 seqs] NC_008086.1 Helicobacter pylori HPAG1, complete genome [...]
0.874615    60/1000 135 9.96005e-210    GCF_000299835.1_ASM29983v1_genomic.fna.gz   [10 seqs] NZ_AMOS01000001.1 Helicobacter pylori R32b HPR32b.contig.0, whole genome shotgun sequence [...]
0.867054    50/1000 142 5.34247e-171    GCF_000824865.1_H3014_genomic.fna.gz    [51 seqs] NZ_CCMU01000002.1 Helicobacter pylori genome assembly H3014, contig H3014_001, whole genome shotgun sequence [...]

For H.pylori Kx-96A

0.972684    559/1000    113 0   GCF_000299835.1_ASM29983v1_genomic.fna.gz   [10 seqs] NZ_AMOS01000001.1 Helicobacter pylori R32b HPR32b.contig.0, whole genome shotgun sequence [...]
0.91934 171/1000    109 0   GCF_900087475.1_TC-1_1_genomic.fna.gz   [78 seqs] NZ_FLOM01000001.1 Helicobacter pylori isolate TC-1_1, whole genome shotgun sequence [...]
0.911545    143/1000    154 0   GCF_000894175.1_ViralProj80923_genomic.fna.gz   NC_016568.1 Helicobacter phage phiHP33, complete genome
0.880455    69/1000 109 2.82977e-253    GCF_000275125.1_ASM27512v1_genomic.fna.gz   [16 seqs] NZ_AKOI01000001.1 Helicobacter pylori Hp H-28 HpH_28.contig.0_1, whole genome shotgun sequence [...]
0.875981    62/1000 108 8.72094e-225    GCF_000013245.1_ASM1324v1_genomic.fna.gz    [2 seqs] NC_008086.1 Helicobacter pylori HPAG1, complete genome [...]

For LE-Ecoli

Writing output...
0.998649    972/1000    62  0   GCF_000025745.1_ASM2574v1_genomic.fna.gz    NC_017628.1 Escherichia coli IHE3034, complete genome
0.989958    809/1000    58  0   GCF_000846325.1_ViralProj14414_genomic.fna.gz   NC_001609.1 Enterobacteria phage P4, complete genome
0.964685    470/1000    61  0   GCF_000839125.1_ViralProj14162_genomic.fna.gz   NC_003444.1 Enterobacteria phage SfV, complete genome
0.964095    464/1000    59  0   GCF_000836905.1_ViralProj14035_genomic.fna.gz   NC_001895.1 Enterobacteria phage P2, complete genome
0.958543    411/1000    60  0   GCF_000930115.1_ViralProj271780_genomic.fna.gz  NC_026014.1 Enterobacteria phage P88, complete genome

Conclusion

For the ACHIM strains, ACHIM_22-1 was initially a mixture from silva screening and ACHIM_51 was a pure isolate. From the above results, it is clear that we could use mash screen to have an idea of what species we have in our query genomes but also what reference genome we can use for annotation genomes using prokka.

N.B. Sorry I am going to go through the Markdown tutorial. I am just starting to use it. Pardon me please.

Nice job! 👍 Let's discuss this in more detail in our meeting in 10 minutes :).

This is really interesting! I would love to see what output my simple "classification script", assess_mash_screen.py, gives on the different output files you produced (mashscreengenomes.tab). If it gives accurate results it means we can start implementing mash screen for both contaminant and species identification in BACTpipe :)

I just thought about a way to possibly improve the specificity of species identification. However, it might come with some ceaveats that I haven't figured out yet, so please consider possible errors in my reasoning. I'm just toying with some ideas here, so feel free to correct me if I make any wrong assumptions.

From Mash we get a percent identity, but also the proportion of kmers that matches (which is used internally in Mash to compute the percent identity). Combining these values in a clever way might improve our assessment of the species in a sample, e.g. maybe even something simple like percent_identity * (shared_kmers / total_kmers) combined with a double threshold system where we have a minimum threshold for any kind of species prediction, and a second threshold that is dynamically computed depending on the top ranking hit (similar to what we do now) to discard similar-but-not-similar-enough reference sequences. Trying this on Sample_171013-P280 would give the following results:

percent_identiy	shared_hashes	total_hashes	species	classification_score
0.990364	816	1000	bifermentans	0.808137
0.988838	790	1000	bifermentans	0.781182
0.97868	636	1000	clostridium	0.62244
0.977434	618	1000	candidatus dorea	0.604054
0.970481	533	1000	bifermentans	0.517266
0.961242	436	1000	benzoelyticum	0.419102

If we have a minimum first threshold of classification at a score of say 0.75 (this is not a percent identity-score any more), and then use a second dynamic threshold of maximum 0.15 from the top hit, we get a pure classification of bifermentans for this sample.

I'm unsure of how to best make use of the multiplicity value (the third column in the mash screen output), maybe that can also be incorporated somehow.

I'll implement the most recent classification method based on classification_score as described in my previous post. Then @b16joski can rerun mash screen on all samples using the new classification method so we can see how it performs.

Test of Mashscreen with ACHIM genomes

All ACHIM whole genomes sequenced so far were screened with mash to classify both mixtures and pure isolates. This would provide a hint of what species we most likely have in query samples.

Raw ACHIM reads to be screened with mash

Here is the directory of query Achim genomes to screen with mash /home/joseph.kirangwa/ACHIM

ACHIM_24-1_R1.fastq.gz  ACHIM_27-1_R2.fastq.gz  ACHIM_35-3_R1.fastq.gz  ACHIM_40-3_R2.fastq.gz  ACHIM_47-5_R1.fastq.gz  ACHIM_59_R2.fastq.gz
ACHIM_22-1_R1.fastq.gz  ACHIM_24-1_R2.fastq.gz  ACHIM_27-2_R1.fastq.gz  ACHIM_35-3_R2.fastq.gz  ACHIM_40-4_R1.fastq.gz  ACHIM_47-5_R2.fastq.gz  ACHIM_60_R1.fastq.gz
ACHIM_22-1_R2.fastq.gz  ACHIM_24-2_R1.fastq.gz  ACHIM_27-2_R2.fastq.gz  ACHIM_35-4_R1.fastq.gz  ACHIM_40-4_R2.fastq.gz  ACHIM_48-1_R1.fastq.gz  ACHIM_60_R2.fastq.gz
ACHIM_22-2_R1.fastq.gz  ACHIM_24-2_R2.fastq.gz  ACHIM_27-3_R1.fastq.gz  ACHIM_35-4_R2.fastq.gz  ACHIM_40-5_R1.fastq.gz  ACHIM_48-1_R2.fastq.gz  ACHIM_61_R1.fastq.gz
ACHIM_22-2_R2.fastq.gz  ACHIM_24-3_R1.fastq.gz  ACHIM_27-3_R2.fastq.gz  ACHIM_35-5_R1.fastq.gz  ACHIM_40-5_R2.fastq.gz  ACHIM_51_R1.fastq.gz    ACHIM_61_R2.fastq.gz
ACHIM_22-3_R1.fastq.gz  ACHIM_24-3_R2.fastq.gz  ACHIM_27-4_R1.fastq.gz  ACHIM_35-5_R2.fastq.gz  ACHIM_44-1_R1.fastq.gz  ACHIM_51_R2.fastq.gz    ACHIM_62_R1.fastq.gz
ACHIM_22-3_R2.fastq.gz  ACHIM_24-4_R1.fastq.gz  ACHIM_27-4_R2.fastq.gz  ACHIM_38-1_R1.fastq.gz  ACHIM_44-1_R2.fastq.gz  ACHIM_52_R1.fastq.gz    ACHIM_62_R2.fastq.gz
ACHIM_22-4_R1.fastq.gz  ACHIM_24-4_R2.fastq.gz  ACHIM_27-5_R1.fastq.gz  ACHIM_38-1_R2.fastq.gz  ACHIM_44-2_R1.fastq.gz  ACHIM_52_R2.fastq.gz    ACHIM_63_R1.fastq.gz
ACHIM_22-4_R2.fastq.gz  ACHIM_24-5_R1.fastq.gz  ACHIM_27-5_R2.fastq.gz  ACHIM_38-2_R1.fastq.gz  ACHIM_44-2_R2.fastq.gz  ACHIM_53_R1.fastq.gz    ACHIM_63_R2.fastq.gz
ACHIM_22-5_R1.fastq.gz  ACHIM_24-5_R2.fastq.gz  ACHIM_33-1_R1.fastq.gz  ACHIM_38-2_R2.fastq.gz  ACHIM_44-3_R1.fastq.gz  ACHIM_53_R2.fastq.gz
ACHIM_22-5_R2.fastq.gz  ACHIM_26-1_R1.fastq.gz  ACHIM_33-1_R2.fastq.gz  ACHIM_38-3_R1.fastq.gz  ACHIM_44-3_R2.fastq.gz  ACHIM_54_R1.fastq.gz
ACHIM_23-1_R1.fastq.gz  ACHIM_26-1_R2.fastq.gz  ACHIM_33-2_R1.fastq.gz  ACHIM_38-3_R2.fastq.gz  ACHIM_44-5_R1.fastq.gz  ACHIM_54_R2.fastq.gz    
ACHIM_23-1_R2.fastq.gz  ACHIM_26-2_R1.fastq.gz  ACHIM_33-2_R2.fastq.gz  ACHIM_38-4_R1.fastq.gz  ACHIM_44-5_R2.fastq.gz  ACHIM_55_R1.fastq.gz
ACHIM_23-2_R1.fastq.gz  ACHIM_26-2_R2.fastq.gz  ACHIM_33-3_R1.fastq.gz  ACHIM_38-4_R2.fastq.gz  ACHIM_47-1_R1.fastq.gz  ACHIM_55_R2.fastq.gz
ACHIM_23-2_R2.fastq.gz  ACHIM_26-3_R1.fastq.gz  ACHIM_33-3_R2.fastq.gz  ACHIM_38-5_R1.fastq.gz  ACHIM_47-1_R2.fastq.gz  ACHIM_56_R1.fastq.gz
ACHIM_23-3_R1.fastq.gz  ACHIM_26-3_R2.fastq.gz  ACHIM_33-5_R1.fastq.gz  ACHIM_38-5_R2.fastq.gz  ACHIM_47-2_R1.fastq.gz  ACHIM_56_R2.fastq.gz
ACHIM_23-3_R2.fastq.gz  ACHIM_26-4_R1.fastq.gz  ACHIM_33-5_R2.fastq.gz  ACHIM_40-1_R1.fastq.gz  ACHIM_47-2_R2.fastq.gz  ACHIM_57_R1.fastq.gz
ACHIM_23-4_R1.fastq.gz  ACHIM_26-4_R2.fastq.gz  ACHIM_35-1_R1.fastq.gz  ACHIM_40-1_R2.fastq.gz  ACHIM_47-3_R1.fastq.gz  ACHIM_57_R2.fastq.gz
ACHIM_23-4_R2.fastq.gz  ACHIM_26-5_R1.fastq.gz  ACHIM_35-1_R2.fastq.gz  ACHIM_40-2_R1.fastq.gz  ACHIM_47-3_R2.fastq.gz  ACHIM_58_R1.fastq.gz
ACHIM_23-5_R1.fastq.gz  ACHIM_26-5_R2.fastq.gz  ACHIM_35-2_R1.fastq.gz  ACHIM_40-2_R2.fastq.gz  ACHIM_47-4_R1.fastq.gz  ACHIM_58_R2.fastq.gz
ACHIM_23-5_R2.fastq.gz  ACHIM_27-1_R1.fastq.gz  ACHIM_35-2_R2.fastq.gz  ACHIM_40-3_R1.fastq.gz  ACHIM_47-4_R2.fastq.gz  ACHIM_59_R1.fastq.gz

code

/home/joseph.kirangwa/ACHIM/mash.nextflow.nf

#!/usr/bin/env nextflow

//Definition of default parameters

params.reads = '*_R{1,2}.fastq.gz'
params.database = "/db/refseq/*.msh"

//Parsing the input

ref_database = file( params.database )

//validate input file

if( !ref_database.exists() ) exit 1, "Missing reference database file: ${ref_database}"

//define input channel for paired end reads

Channel
    .fromFilePairs( params.reads )
    .ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
    .set { read_pairs }

//define mash proces

process mash {
    cpus 8
    tag { pair_id }
    publishDir "./mash.screen", mode: 'copy'

    input:
    set pair_id, file(reads) from read_pairs

    output:
    set pair_id, file("${pair_id}.reads.csv") into next_ch

    script:

    """
    mash screen -w -p ${task.cpus} ${ref_database[0]} ${reads[0]} ${reads[1]} > ${pair_id}.reads.csv
    """

 }

//sort and classify mash screen outputs using `assess_mash_screen.py`

process mashsort {
    validExitStatus 0,1,2
    tag {sample_id}
    publishDir "./mash.screen_sorted", mode: 'copy'

    input:
    set sample_id, file(achim) from next_ch

    output:
    file("${sample_id}.mashscreen.tab")

    script:

    """
    sort -gr ${achim} | head -5 > ${sample_id}.achim.sorted.tab
    python /home/joseph.kirangwa/mash-script/mash_scripts/assess_mash_screen.py ${sample_id}.achim.sorted.tab -o ${sample_id}.mashscreen.tab
    """

 }

workflow.onComplete {
        println ( workflow.success ? "Done!" : "Oops .. something went wrong" )
}

Code execution

(base) [joseph.kirangwa@ctmr-nas ACHIM]$ nextflow run mash.nextflow.nf

Results

WARNING: Sample 'ACHIM_22-1' likely contains more than one species: Paraclostridium bifermentans, Eubacterium limosum
0.990652    (821, 1000) 81  0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.8133252919999999
0.985891    (742, 1000) 24  0.0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz  [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...]   0.7315311219999999
Sample 'ACHIM_22-2' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_22-3' probably consist of only a single species: Paraclostridium bifermentans
WARNING: Sample 'ACHIM_22-4' likely contains more than one species: Paraclostridium bifermentans, Eubacterium limosum
0.990422    (817, 1000) 38  0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.8091747739999999
0.985573    (737, 1000) 9   0.0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz  [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...]   0.726367301
WARNING: Sample 'ACHIM_22-5' likely contains more than one species: Paraclostridium bifermentans, Eubacterium limosum
0.990479    (818, 1000) 38  0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.810211822
0.9857  (739, 1000) 10  0.0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz  [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...]   0.7284323
WARNING: Sample 'ACHIM_23-1' likely contains more than one species: Bacteroides uniformis, Paraclostridium bifermentans
0.994092    (883, 1000) 64  0.0 GCF_001406135.1_13470_2_66_genomic.fna.gz   [20 seqs] NZ_CZAF01000001.1 Bacteroides uniformis strain 2789STDY5834847, whole genome shotgun sequence [...]   0.877783236
0.988719    (788, 1000) 5   0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.779110572
Sample 'ACHIM_23-2' probably consist of only a single species: Bacteroides uniformis
Sample 'ACHIM_23-3' probably consist of only a single species: Bacteroides uniformis
Sample 'ACHIM_23-4' probably consist of only a single species: Bacteroides sp.
WARNING: Sample 'ACHIM_23-5' likely contains more than one species: Paraclostridium bifermentans, Bacteroides sp.
0.986582    (753, 1000) 3   0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.7428962459999999
0.985764    (740, 1000) 41  0.0 GCF_000218365.1_Bacteroides_sp_1_1_30_V1_genomic.fna.gz [73 seqs] NZ_GL945090.1 Bacteroides sp. 1_1_30 genomic scaffold supercont1.1, whole genome shotgun sequence [...]   0.72946536
Sample 'ACHIM_24-1' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_24-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_24-3' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_24-4' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_24-5' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_26-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_26-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_26-3' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_26-4' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_26-5' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_27-1' probably consist of only a single species: Eubacterium limosum
WARNING: Sample 'ACHIM_27-2' likely contains more than one species: Paraclostridium bifermentans, Eubacterium limosum
0.989016    (793, 1000) 6   0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.7842896880000001
0.9857  (739, 1000) 19  0.0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz  [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...]   0.7284323
Sample 'ACHIM_27-3' probably consist of only a single species: Eubacterium limosum
Sample 'ACHIM_27-4' probably consist of only a single species: Eubacterium limosum
Sample 'ACHIM_27-5' probably consist of only a single species: Eubacterium limosum
Sample 'ACHIM_33-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_33-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_33-3' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_33-5' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-3' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-4' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-5' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_38-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_38-2' probably consist of only a single species: Prevotella phocaeensis
WARNING: Sample 'ACHIM_38-3' likely contains more than one species: Prevotella phocaeensis, Bacteroides uniformis
0.999135    (982, 1000) 21  0.0 GCF_900065875.1_PRJEB126981_genomic.fna.gz  [12 seqs] NZ_LT160615.1 Prevotella phocaeensis strain SN19, whole genome shotgun sequence [...] 0.9811505699999999
0.994039    (882, 1000) 52  0.0 GCF_001406135.1_13470_2_66_genomic.fna.gz   [20 seqs] NZ_CZAF01000001.1 Bacteroides uniformis strain 2789STDY5834847, whole genome shotgun sequence [...]   0.876742398
Sample 'ACHIM_38-4' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_38-5' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-3' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-4' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-5' probably consist of only a single species: Prevotella phocaeensis
WARNING: Sample 'ACHIM_44-1' likely contains more than one species: Clostridium perfringens, Paraclostridium bifermentans
0.992346    (851, 1000) 84  0.0 GCF_000512415.1_JJC_1.0_genomic.fna.gz  [69 seqs] NZ_AWRZ01000001.1 Clostridium perfringens JJC contig1, whole genome shotgun sequence [...]    0.8444864459999999
0.989899    (808, 1000) 7   0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.7998383920000001
WARNING: Sample 'ACHIM_44-2' likely contains more than one species: Paraclostridium bifermentans, Clostridium perfringens
0.992346    (851, 1000) 140 0.0 GCF_000512415.1_JJC_1.0_genomic.fna.gz  [69 seqs] NZ_AWRZ01000001.1 Clostridium perfringens JJC contig1, whole genome shotgun sequence [...]    0.8444864459999999
0.990306    (815, 1000) 22  0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.80709939
Sample 'ACHIM_44-3' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_44-5' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_47-1' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_47-2' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_47-3' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_47-4' probably consist of only a single species: Clostridium perfringens
WARNING: Sample 'ACHIM_47-5' likely contains more than one species: Clostridium perfringens, Paraclostridium bifermentans
0.992235    (849, 1000) 69  0.0 GCF_000512415.1_JJC_1.0_genomic.fna.gz  [69 seqs] NZ_AWRZ01000001.1 Clostridium perfringens JJC contig1, whole genome shotgun sequence [...]    0.842407515
0.989782    (806, 1000) 6   0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.7977642920000001
Sample 'ACHIM_48-1' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_48-2' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_48-3' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_48-4' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_48-5' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_49-1' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_49-2' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_51' probably consist of only a single species: Staphylococcus warneri
Sample 'ACHIM_52' probably consist of only a single species: Paraclostridium bifermentans
WARNING: Sample 'ACHIM_53' likely contains more than one species: Staphylococcus warneri, Bacteroides uniformis
0.994092    (883, 1000) 102 0.0 GCF_001406135.1_13470_2_66_genomic.fna.gz   [20 seqs] NZ_CZAF01000001.1 Bacteroides uniformis strain 2789STDY5834847, whole genome shotgun sequence [...]   0.877783236
0.989489    (801, 1000) 2   0.0 GCF_000712925.1_psaLyso211_genomic.fna.gz   [49 seqs] NZ_JOPV01000001.1 Staphylococcus warneri Lyso 2 2011 psaLyso211.contig.0, whole genome shotgun sequence [...] 0.792580689
Sample 'ACHIM_54' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_55' probably consist of only a single species: Paraclostridium bifermentans
WARNING: Sample 'ACHIM_56' likely contains more than one species: Bacteroides uniformis, Paraclostridium bifermentans
0.993985    (881, 1000) 9   0.0 GCF_001406135.1_13470_2_66_genomic.fna.gz   [20 seqs] NZ_CZAF01000001.1 Bacteroides uniformis strain 2789STDY5834847, whole genome shotgun sequence [...]   0.875700785
0.990306    (815, 1000) 51  0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz  [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]  0.80709939
Sample 'ACHIM_57' probably consist of only a single species: Clostridium sp.
Sample 'ACHIM_58' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_59' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_60' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_61' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_62' probably consist of only a single species: Clostridium sp.
Sample 'ACHIM_63' probably consist of only a single species: Clostridium sporogenes

remark

Looks like we have some false positive probable single species outputs for example Achim_22-2, Achim_22-3, ACHIM_59 when I compare with previous silver classification. We could probably maximise on specificity.

Results from using default classification score 0.20

ACHIM_22-1  FAIL    Eubacterium limosum, Paraclostridium bifermentans
ACHIM_22-2  FAIL    Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-3  FAIL    Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-4  FAIL    Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-5  FAIL    Eubacterium limosum, Paraclostridium bifermentans
ACHIM_23-1  FAIL    Bacteroides uniformis, Paraclostridium bifermentans
ACHIM_23-2  PASS    Bacteroides uniformis
ACHIM_23-3  PASS    Bacteroides uniformis
ACHIM_23-4  PASS    Bacteroides sp.
ACHIM_23-5  FAIL    Bacteroides sp., Paraclostridium bifermentans
ACHIM_24-1  PASS    Paraclostridium bifermentans
ACHIM_24-2  FAIL    Prevotella phocaeensis, Paraclostridium bifermentans
ACHIM_24-3  PASS    Paraclostridium bifermentans
ACHIM_24-4  PASS    Paraclostridium bifermentans
ACHIM_24-5  PASS    Paraclostridium bifermentans
ACHIM_26-1  PASS    Prevotella phocaeensis
ACHIM_26-2  PASS    Prevotella phocaeensis
ACHIM_26-3  PASS    Prevotella phocaeensis
ACHIM_26-4  PASS    Prevotella phocaeensis
ACHIM_26-5  PASS    Prevotella phocaeensis
ACHIM_27-1  PASS    Eubacterium limosum
ACHIM_27-2  FAIL    Eubacterium limosum, Paraclostridium bifermentans
ACHIM_27-3  PASS    Eubacterium limosum
ACHIM_27-4  PASS    Eubacterium limosum
ACHIM_27-5  PASS    Eubacterium limosum
ACHIM_33-1  PASS    Prevotella phocaeensis
ACHIM_33-2  PASS    Prevotella phocaeensis
ACHIM_33-3  PASS    Prevotella phocaeensis
ACHIM_33-5  PASS    Prevotella phocaeensis
ACHIM_35-1  PASS    Prevotella phocaeensis
ACHIM_35-2  PASS    Prevotella phocaeensis
ACHIM_35-3  PASS    Prevotella phocaeensis
ACHIM_35-4  PASS    Prevotella phocaeensis
ACHIM_35-5  PASS    Prevotella phocaeensis
ACHIM_38-1  PASS    Prevotella phocaeensis
ACHIM_38-2  PASS    Prevotella phocaeensis
ACHIM_38-3  FAIL    Prevotella phocaeensis, Bacteroides uniformis
ACHIM_38-4  PASS    Prevotella phocaeensis
ACHIM_38-5  PASS    Prevotella phocaeensis
ACHIM_40-1  PASS    Prevotella phocaeensis
ACHIM_40-2  PASS    Prevotella phocaeensis
ACHIM_40-3  PASS    Prevotella phocaeensis
ACHIM_40-4  PASS    Prevotella phocaeensis
ACHIM_40-5  PASS    Prevotella phocaeensis
ACHIM_44-1  FAIL    Paraclostridium bifermentans, Clostridium perfringens
ACHIM_44-2  FAIL    Clostridium perfringens, Paraclostridium bifermentans
ACHIM_44-3  PASS    Paraclostridium bifermentans
ACHIM_44-5  PASS    Clostridium perfringens
ACHIM_47-1  PASS    Paraclostridium bifermentans
ACHIM_47-2  PASS    Paraclostridium bifermentans
CHIM_47-3   PASS    Clostridium perfringens
ACHIM_47-4  PASS    Clostridium perfringens
ACHIM_47-5  FAIL    Clostridium perfringens, Paraclostridium bifermentans
ACHIM_48-1  PASS    Paraclostridium bifermentans
ACHIM_48-2  PASS    Paraclostridium bifermentans
ACHIM_48-3  PASS    Paraclostridium bifermentans
ACHIM_48-4  PASS    Paraclostridium bifermentans
ACHIM_48-5  PASS    Paraclostridium bifermentans
ACHIM_49-1  PASS    Clostridium perfringens
ACHIM_49-2  PASS    Clostridium perfringens
ACHIM_51        PASS    Staphylococcus warneri
ACHIM_52        PASS    Paraclostridium bifermentans
ACHIM_53        FAIL    Bacteroides uniformis, Staphylococcus warneri
ACHIM_54        PASS    Paraclostridium bifermentans
ACHIM_55        PASS    Paraclostridium bifermentans
ACHIM_56        FAIL    Paraclostridium bifermentans, Bacteroides uniformis
ACHIM_57        PASS    Clostridium sp.
ACHIM_58        PASS    Paraclostridium bifermentans
ACHIM_59        PASS    Paraclostridium bifermentans
ACHIM_60        PASS    Clostridium perfringens
ACHIM_61        PASS    Clostridium perfringens
ACHIM_62        PASS    Clostridium sp.
ACHIM_63        PASS    Clostridium sporogenes

Nice!

I've pushed a new version of my assess_mash_screen.py script to classify the output from mash screen. I hope it works well, I haven't really tested it properly...

https://github.com/boulund/mash_scripts/commit/f4955bb3665788b2b7243aac58d169a8e6fed0ef

After our meeting today, I updated the assess_mash_screen.py script with a new output format that makes it easier to compare outputs, and I also changed the default classification score modifier to 0.20.

https://github.com/boulund/mash_scripts/commit/d166d36560d56b5b3e7614aa3959c1e81cf2abe8

I added results from the rerun of assess_mash_screen.py using the default classification score of 0.20.

@b16joski, Thanks for that. I would have liked to see some assessment of whether it improved the results or not? Please write a summary of your analysis as well. Also, it would be much nicer if you posted it as a new comment instead of editing the previous one--it makes it much easier to follow the series of events.

Classification of mash screen outputs using a classification score modifier of 0.20

From the previous classification of mash screen results using 0.15 as the classification score modifier (reduction), some of the expected mixed isolates turned out pure. So we decided to relax the classification score modifier to 0.20 and compare results.

method

The outputs from mash screen *.tsvfiles were used as inputs for the script assess_mash_screen.py Briefly the script was run as follows;

for filename in *.csv; do
    python ./assess_mash_screen.py -c 0.20 $filename
done

Output

ACHIM_22-1  FAIL    Eubacterium limosum, Paraclostridium bifermentans
ACHIM_22-2  FAIL    Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-3  FAIL    Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-4  FAIL    Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-5  FAIL    Eubacterium limosum, Paraclostridium bifermentans
ACHIM_23-1  FAIL    Bacteroides uniformis, Paraclostridium bifermentans
ACHIM_23-2  PASS    Bacteroides uniformis
ACHIM_23-3  PASS    Bacteroides uniformis
ACHIM_23-4  PASS    Bacteroides sp.
ACHIM_23-5  FAIL    Bacteroides sp., Paraclostridium bifermentans
ACHIM_24-1  PASS    Paraclostridium bifermentans
ACHIM_24-2  FAIL    Prevotella phocaeensis, Paraclostridium bifermentans
ACHIM_24-3  PASS    Paraclostridium bifermentans
ACHIM_24-4  PASS    Paraclostridium bifermentans
ACHIM_24-5  PASS    Paraclostridium bifermentans
ACHIM_26-1  PASS    Prevotella phocaeensis
ACHIM_26-2  PASS    Prevotella phocaeensis
ACHIM_26-3  PASS    Prevotella phocaeensis
ACHIM_26-4  PASS    Prevotella phocaeensis
ACHIM_26-5  PASS    Prevotella phocaeensis
ACHIM_27-1  PASS    Eubacterium limosum
ACHIM_27-2  FAIL    Eubacterium limosum, Paraclostridium bifermentans
ACHIM_27-3  PASS    Eubacterium limosum
ACHIM_27-4  PASS    Eubacterium limosum
ACHIM_27-5  PASS    Eubacterium limosum
ACHIM_33-1  PASS    Prevotella phocaeensis
ACHIM_33-2  PASS    Prevotella phocaeensis
ACHIM_33-3  PASS    Prevotella phocaeensis
ACHIM_33-5  PASS    Prevotella phocaeensis
ACHIM_35-1  PASS    Prevotella phocaeensis
ACHIM_35-2  PASS    Prevotella phocaeensis
ACHIM_35-3  PASS    Prevotella phocaeensis
ACHIM_35-4  PASS    Prevotella phocaeensis
ACHIM_35-5  PASS    Prevotella phocaeensis
ACHIM_38-1  PASS    Prevotella phocaeensis
ACHIM_38-2  PASS    Prevotella phocaeensis
ACHIM_38-3  FAIL    Prevotella phocaeensis, Bacteroides uniformis
ACHIM_38-4  PASS    Prevotella phocaeensis
ACHIM_38-5  PASS    Prevotella phocaeensis
ACHIM_40-1  PASS    Prevotella phocaeensis
ACHIM_40-2  PASS    Prevotella phocaeensis
ACHIM_40-3  PASS    Prevotella phocaeensis
ACHIM_40-4  PASS    Prevotella phocaeensis
ACHIM_40-5  PASS    Prevotella phocaeensis
ACHIM_44-1  FAIL    Paraclostridium bifermentans, Clostridium perfringens
ACHIM_44-2  FAIL    Clostridium perfringens, Paraclostridium bifermentans
ACHIM_44-3  PASS    Paraclostridium bifermentans
ACHIM_44-5  PASS    Clostridium perfringens
ACHIM_47-1  PASS    Paraclostridium bifermentans
ACHIM_47-2  PASS    Paraclostridium bifermentans
CHIM_47-3   PASS    Clostridium perfringens
ACHIM_47-4  PASS    Clostridium perfringens
ACHIM_47-5  FAIL    Clostridium perfringens, Paraclostridium bifermentans
ACHIM_48-1  PASS    Paraclostridium bifermentans
ACHIM_48-2  PASS    Paraclostridium bifermentans
ACHIM_48-3  PASS    Paraclostridium bifermentans
ACHIM_48-4  PASS    Paraclostridium bifermentans
ACHIM_48-5  PASS    Paraclostridium bifermentans
ACHIM_49-1  PASS    Clostridium perfringens
ACHIM_49-2  PASS    Clostridium perfringens
ACHIM_51        PASS    Staphylococcus warneri
ACHIM_52        PASS    Paraclostridium bifermentans
ACHIM_53        FAIL    Bacteroides uniformis, Staphylococcus warneri
ACHIM_54        PASS    Paraclostridium bifermentans
ACHIM_55        PASS    Paraclostridium bifermentans
ACHIM_56        FAIL    Paraclostridium bifermentans, Bacteroides uniformis
ACHIM_57        PASS    Clostridium sp.
ACHIM_58        PASS    Paraclostridium bifermentans
ACHIM_59        PASS    Paraclostridium bifermentans
ACHIM_60        PASS    Clostridium perfringens
ACHIM_61        PASS    Clostridium perfringens
ACHIM_62        PASS    Clostridium sp.
ACHIM_63        PASS    Clostridium sporogenes

Remarks

Originally samples (ACHIM 22-1, ACHIM 22-22, ACHIM 22-3, ACHIM 22-4, ACHIM 22-5, ACHIM 53 ACHIM 56) were manually classified as mix using SILVA database. While using the classification score of 0.15 some turned out pure. However, adjusting the classification score to 0.20, these came out as FAIL in the adjusted classification score of 0.20. Of interest is ACHIM 59, found to be a mix with SILVA and turning out pure with the classification score of 0.20. It would be nice to asses the percentage error rate here.

Great job! So it seems as if we have a detection rate of 72 out of 73 samples in this test, is that correct? I realize that the selection of samples probably isn't very varied in this case, but having a mixture detection rate accuracy of about 98% is actually quite good considering how basic the method we're using is... I'm entirely OK with continuing with everything as-is after seeing these results, but let's keep an eye out for how it performs on future samples. Looking forward to seeing the new code implemented soon 👍

Now I have pushed the new version of BACTpipev2.0 on github. I tested it out. I am looking at the results folders. I will give some comments. The code runs well without errors while simulating what we discussed theoretically in our previous meeting.

I have two suggestions

To implement multiqc on some of the process outputs as the last step
To have a validation and visual tool after generation of contigs.fa from the de novo assembly process. We can compare contigs.fa with either a customised database of highly curated references. This would help validate mash screen initial predication. Pyani is good alternative but needs inputs containing both query samples and a set of references in one directory to compute the percentage identities between the queries and the ref seqs. Hope I have brought out the idea well. What do you think?

This is very good. I think we can start to consider this issue resolved by now (and also #26 ). However, let's keep this issue open until we merge your local changes into the main repository.

I like it that you have more ideas for improvements @b16joski! Create a separate issue for each of your ideas so we can discuss them separately on their own as they are quite independent and not really connected to this issue either.

We just merged BACTpipeV2.0 into master.

ctmrbio / BACTpipe

Determine the best way to identify species from assembly #25

Test of Mash screen

Background

Method

Samples

Results

Conclusions

Test of mash screen with known genomes

Background

Method

Results

Conclusion

Test of Mashscreen with ACHIM genomes

Raw ACHIM reads to be screened with mash

code

Code execution

Results

remark

Results from using default classification score 0.20

Classification of mash screen outputs using a classification score modifier of 0.20

method

Output

Remarks