Closed boulund closed 6 years ago
This is a suggestion I got from the microbioinfo Slack people @b16joski and @boulund :
Adam Phillippy (the mash guy) also writes: FYI, We are releasing an improved ANI tool in a few weeks called “FastANI” that is Mash-like but specifically designed for ANI and more accurate/sensitive. Preprint coming soon... Also fast, as the name implies ;)
I just had a quick look at the pyani Github repo, and I like the pyani suggestion. I think it would work well for our purposes. It doesn't have a lot of extra dependencies that are hard to install either. If the MUMer-based approach (ANIm) works well in our case, I think that's the best route to aim for. Not entirely sure how well it will perform if we have a very large database of reference genomes to compare to... I guess we'll have to try and see!
It might still makes sense to compare to specI as well. According to their paper (of course) specI has slightly better accuracy in assignments than ANI using NUCmer (nucleotide alignment-part of MUMer; ran as part of JSpecies), and a lot better runtime performance (i.e. it's much faster).
Gonna test pyani on some reference genomes this week to see how it compares with species
Ok, so I've made a bit of research here. This started out as me wanting to test Mash for screening (for issue #26) but ended up a bit larger than that... Mash for screening seems to work fairly well, but I think we can also tweak it to solve the species identification issue that we've been discussing to solve with pyani or something like that.
Yesterday, I made a really quick test of pyani against all of refseq using a single set of H. pylori contigs, just because I wanted to see just how long it would take. I turns out it didn't manage to finish at all. It died sometime last night due to memory contraints. Huh, so I guess running pyani with nucmer
against all of refseq is not an option.
There's a nice writeup of the Mash approach to k-mer screening in a blog post: https://genomeinformatics.github.io/mash-screen/ I suggest you have a look if you haven't seen it already.
This morning, I tested Mash screen a bit, intending to just make a quick test to see how it works.
I used the screening method described on the Mash tutorial homepage, with their refseq.genomes.k21.s1000.msh that I downloaded to /db/refseq/
on CTMR-NAS.
Then I ran
mash screen -w -p 8 /db/refseq/refseq.genomes.k21s1000.msh <path/to/sample_file.fastq.gz>
for my different sample files, described below.
I whipped together a small, quick'n'dirty Python script for assessing the output from mash screen
, just to get a feel for if it works at all for our use case, and how much extra tooling we need to develop in order to get something useful out of it. The code is here:
https://github.com/boulund/mash_scripts/blob/master/assess_mash_screen.py
I tried it with a Helicobacter pylori sample (G162-C
, from an old project me and Kaisa worked on), both using the raw reads and the contigs.
I also tested with 171013_M03284_0074_000000000-BBMHR/Unaligned-Y301I8I8Y301/Project_KaisaWGS171004/Sample_171013-P280/171013-P280_ATTACTCG-AGGCGAAG_L001_R{1,2}*gz
from the INBOX on CTMR-NAS.
When run on the G162-C
contigs:
[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py screen.tab
The sample probably consist of only a single species: Helicobacter pylori
When run on the G162-C
raw reads:
[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py reads_screen.tab
The sample probably consist of only a single species: Helicobacter pylori
When run on Sample_171013-P280
:
[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py testwgs.tab
WARNING: The sample likely contains more than one species: Paraclostridium bifermentans, Candidatus Dorea, Clostridium sp., Paraclostridium benzoelyticum
0.990364 (816, 1000) 59 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]
0.988838 (790, 1000) 59 0.0 GCF_000452225.2_ASM45222v2_genomic.fna.gz [58 seqs] NZ_AVNB01000001.1 Paraclostridium bifermentans ATCC 19299 gcdATCC19299.contig.0, whole genome shotgun sequence [...]
0.97868 (636, 1000) 59 0.0 GCF_000764005.1_ASM76400v1_genomic.fna.gz [56 seqs] NZ_JQHY01000001.1 Clostridium sp. NCR contig00001, whole genome shotgun sequence [...]
0.977343 (618, 1000) 59 0.0 GCF_001282705.1_Dorea_massiliensis_genomic.fna.gz [36 seqs] NZ_LN876588.1 Candidatus Dorea massiliensis AP6 genome assembly Dorea massiliensis, scaffold scaffold00001, whole genome shotgun sequence [...]
0.970481 (533, 1000) 60 0.0 GCF_000452245.2_ASM45224v2_genomic.fna.gz [22 seqs] NZ_AVNC01000001.1 Paraclostridium bifermentans ATCC 638 gcdATCC638.contig.0, whole genome shotgun sequence [...]
0.961242 (436, 1000) 59 0.0 GCF_001006285.1_ASM100628v1_genomic.fna.gz [368 seqs] NZ_LBBT01000001.1 Paraclostridium benzoelyticum strain JC272 Contig1, whole genome shotgun sequence [...]
So, from this I conclude that it's probably feasible to implement mash screen as a quick way to screen for contaminated samples, but I'm not entirely sure how we should inform the user of this, and how to handle the different scenarios that can occur.
I would like to investigate further how accurate it is on a couple of samples that we've run before, so we can determine good cutoffs for the different parameters. Currently the only features I use to assess if a sample consists of multiple species is identity
and shared_hashes
. Currently, I set min_identity = 0.85
and min_shared_hashes_threshold
to 10% of the number of shared_hashes
of the best match (i..e with the highest identity of all matches for this sample).
I think we could use mash screen as a way to get a usable indication of what species, and thus what reference genome to use for genome annotation downstream, after assembly. I see two cases:
What do you think? Any comments?
Great work, really nice to see how things work out.
As for the identification of contaminated samples I think that the pipeline should just abort with an error message/printing of an error log file if there is more than one species detected. And if the user absolutely wants to assemble the genomes anyway, this has to be done manually? Alternatively that there could be some kind of -force option that tells the pipeline to ignore that error. But that might be a bit overkill because to make any sense of the output the contigs of the different genomes needs to be classified/mapped and separated before annotation anyway and there are a lot of potential complications with multi-species assemblies so the usefulness of this is anyway a bit unclear.
Do you think the mash screen could replace pyani entirely or should we still use both steps? The cases where we would like to download and use a reference genome is when there is a complete reference or representative genome for that species, at least that is what we used in Sandras version. It needs to be complete to be useful in Mauve and if it is not I think the usefulness for annotation purposes is probably limited too...
Best wishes,
Kaisa
I think mash could replace the function we would like for
On 22 Nov 2017, at 15:27, Fredrik Boulund notifications@github.com wrote:
Test of Mash screen
Ok, so I've made a bit of research here. This started out as me wanting to test Mash for screen (for issue #26 https://github.com/ctmrbio/BACTpipe/issues/26) but ended up a bit larger than that... Mash for screening seems to work fairly well, but I think we can also tweak it to solve the species identification issue that we've been discussing to solve with pyani or something like that. Yesterday, I made a really quick test of pyani against all of refseq using a single set of H. pylori contigs, just because I wanted to see just how long it would take. I turns out it didn't manage to finish at all. I died sometime last night due to memory contraints. Huh, so I guess running pyani with nucmer is not an option.
Background
There's a nice writeup of the Mash approach to k-mer screening in a blog post: https://genomeinformatics.github.io/mash-screen/ https://genomeinformatics.github.io/mash-screen/ I suggest you have a look if you haven't seen it already.
Method
This morning, I tested Mash screen a bit, intending to just make a quick test to see how it works. I used the screening method described on the Mash tutorial homepage http://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes, with their refseq.genomes.k21.s1000.msh https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh that I downloaded to /db/refseq/ on CTMR-NAS. Then I ran
mash screen -w -p 8 /db/refseq/refseq.genomes.k21s1000.msh <path/to/sample_file.fastq.gz> for my different sample files, described below.
I whipped together a small, quick'n'dirty Python script for assessing the output from mash screen, just to get a feel for if it works at all for our use case, and how much extra tooling we need to develop in order to get something useful out of it. The code is here: https://github.com/boulund/mash_scripts/blob/master/assess_mash_screen.py https://github.com/boulund/mash_scripts/blob/master/assess_mash_screen.py Samples
I tried it with a Helicobacter pylori sample (G162-C, from an old project me and Kaisa worked on), both using the raw reads and the contigs. I also tested with 171013_M03284_0074_000000000-BBMHR/Unaligned-Y301I8I8Y301/Project_KaisaWGS171004/Sample_171013-P280/171013-P280_ATTACTCG-AGGCGAAG_L001_R{1,2}*gz from the INBOX on CTMR-NAS.
Results
When run on the G162-C contigs:
[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py screen.tab The sample probably consist of only a single species: Helicobacter pylori When run on the G162-C raw reads:
[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py reads_screen.tab The sample probably consist of only a single species: Helicobacter pylori When run on Sample_171013-P280:
[fredrik.boulund@ctmr-nas test]$ ~/code/mash_scripts/assess_mash_screen.py testwgs.tab WARNING: The sample likely contains more than one species: Paraclostridium bifermentans, Candidatus Dorea, Clostridium sp., Paraclostridium benzoelyticum 0.990364 (816, 1000) 59 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.988838 (790, 1000) 59 0.0 GCF_000452225.2_ASM45222v2_genomic.fna.gz [58 seqs] NZ_AVNB01000001.1 Paraclostridium bifermentans ATCC 19299 gcdATCC19299.contig.0, whole genome shotgun sequence [...] 0.97868 (636, 1000) 59 0.0 GCF_000764005.1_ASM76400v1_genomic.fna.gz [56 seqs] NZ_JQHY01000001.1 Clostridium sp. NCR contig00001, whole genome shotgun sequence [...] 0.977343 (618, 1000) 59 0.0 GCF_001282705.1_Dorea_massiliensis_genomic.fna.gz [36 seqs] NZ_LN876588.1 Candidatus Dorea massiliensis AP6 genome assembly Dorea massiliensis, scaffold scaffold00001, whole genome shotgun sequence [...] 0.970481 (533, 1000) 60 0.0 GCF_000452245.2_ASM45224v2_genomic.fna.gz [22 seqs] NZ_AVNC01000001.1 Paraclostridium bifermentans ATCC 638 gcdATCC638.contig.0, whole genome shotgun sequence [...] 0.961242 (436, 1000) 59 0.0 GCF_001006285.1_ASM100628v1_genomic.fna.gz [368 seqs] NZ_LBBT01000001.1 Paraclostridium benzoelyticum strain JC272 Contig1, whole genome shotgun sequence [...] Conclusions
So, from this I conclude that it's probably feasible to implement mash screen as a quick way to screen for contaminated samples, but I'm not entirely sure how we should inform the user of this, and how to handle the different scenarios that can occur.
I would like to investigate further how accurate it is on a couple of samples that we've run before, so we can determine good cutoffs for the different parameters. Currently the only features I use to assess if a sample consists of multiple species is identity and shared_hashes. Currently, I set min_identity = 0.85 and min_shared_hashes_threshold to 10% of the number of shared_hashes of the best match (i..e with the highest identity of all matches for this sample).
I think we could use mash screen as a way to get a usable indication of what species, and thus what reference genome to use for genome annotation downstream, after assembly. I see two cases:
We only detect a single species in the sample using mash screen. Then we use mash screen both as an initial screen to detect potential contamination of samples, but also as a way to identify what species is in the sample. If we only detect one species, great! We can now send that information into a separate processing channel in Nextflow so that a suitable reference genome (and its annotation) can be downloaded so it's ready for use when the genome assembly is completed (parallelization, yeay!). We detect multiple species in a sample using mash screen. Then we use the output from mash screen to download all the potential reference genomes that were identified (maybe limit to the top 10 best hits or something), and then run a separate pyani-step after assembly, before genome annotation, to pick out the best match against the few genomes we compare against. What do you think? Any comments?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ctmrbio/BACTpipe/issues/25#issuecomment-346365412, or mute the thread https://github.com/notifications/unsubscribe-auth/AMfZmGq5MHc31YzLBHU-gLO3n3EW_9l9ks5s5C9lgaJpZM4QMphL.
Nice that you had time to look at this @thorellk :)
I agree with you that it's a bit strange to continue the workflow if the screening step detects more than one species in the sample. However, it could be nice to have it continue e.g. to just until the assembly step, to make it easier if the user would want to continue manually working with the data from there. Another suitable step other than assembly could also be used as a stop in these cases (maybe just QC, trimming, etc.?).
We still need to test Mash a bit more on some samples that we know the contents of, to at least get a better sense of how it performs for the most common organisms. Like in the example I tried above, with Sample_171013-P280
, showed that the sample might contain multiple species, but I'm not entirely convinced of the output. As far as I can tell, the suggested "different" species could very well be the same species. Do you remember what it contained? Using mash screen
is probably not going to be super robust in these situations, so we need to get a better feel for what thresholds to use.
If we manage to test Mash on a larger set of samples, and it performs well, I definitely think it can replace our plans with a separate pyani-step after assembly. It's much cleaner than using pyani, with less dependencies and smaller reference database. I would really like it if this worked out :)
That sample (ACHIM-55) is actually, according to SILVA, only containing rRNA from one species: Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Paraclostridium;
jspecies TCS says Clostridium sp. NCR (Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae) but it always only gives one hit so it wouldn’t detect contamination anyway.
So it could be that mash in this case is overdoing it. It is also one of those bacterial taxa that is really messy (several Clostridia have been reclassified to Paraclostridium etc) so it is not one of the easiest to start with.
But that pinpoints the challenges with the species determination quite nicely on the other hand…
I agree with you that performing trimming and QC might be good regardless but to not proceed beyond that point if it is suspected to be contaminated.
BW,
Kaisa
On 23 Nov 2017, at 16:35, Fredrik Boulund notifications@github.com wrote:
Nice that you had time to look at this @thorellk :)
I agree with you that it's a bit strange to continue the workflow if the screening step detects more than one species in the sample. However, it could be nice to have it continue e.g. to just until the assembly step, to make it easier if the user would want to continue manually working with the data from there. Another suitable step other than assembly could also be used as a stop in these cases (maybe just QC, trimming, etc.?).
We still need to test Mash a bit more on some samples that we know the contents of, to at least get a better sense of how it performs for the most common organisms. Like in the example I tried above, with Sample_171013-P280, showed that the sample might contain multiple species, but I'm not entirely convinced of the output. As far as I can tell, the suggested "different" species could very well be the same species. Do you remember what it contained? Using mash screen is probably not going to be super robust in these situations, so we need to get a better feel for what thresholds to use.
If we manage to test Mash on a larger set of samples, and it performs well, I definitely think it can replace our plans with a separate pyani-step after assembly. It's much cleaner than using pyani, with less dependencies and smaller reference database. I would really like it if this worked out :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Diffrent sample types including H.pylori 26695 and 2 Kalix strains, ACHIM and Ecoli raw fastq.gz files were screened in mash in order to rapidly estimate resemblance with the mash screen database of ref genomes.
I downloaded the mash ref Seq database refseq.genomes.k21.s1000.msh
/home/josephk/joseph/nextflow_bactpipe/mash_screen/refseqsketches.msh
Created a folder of test raw reads
/home/josephk/joseph/nextflow_bactpipe/mash_screen/test_genomes
Created a simple bash script with mash screen command
#!/usr/bin/env bash
seq='/home/josephk/joseph/nextflow_bactpipe/mash_screen/test_genomes/*_R1.fastq.gz'
ref='/home/josephk/joseph/nextflow_bactpipe/mash_screen/refseqsketches.msh/refseq.genomes*'
echo $seq
for file1 in $seq
do
echo 'starting mash screen'
name=$(basename $file1)
echo $name
file2=$(echo $file1 | sed 's/\_R1/\_R2/')
mash screen -w -p 8 $ref $file1 $file2 > $name.tab
sort -gr $name.tab | head -5
python ./assses_mash_screen.py $name.tab -o $name.txt
done
For H.pylori 26695
Writing output...
1 1000/1000 191 0 GCF_000008525.1_ASM852v1_genomic.fna.gz NC_000915.1 Helicobacter pylori 26695 chromosome, complete genome
0.88561 78/1000 1 3.40572e-243 GCF_000470135.1_HP87hu_genomic.fna.gz [72 seqs] NZ_CBRI010000001.1 Helicobacter pylori HP87hu WGS project CBRI000000000 data, contig 00001, whole genome shotgun sequence [...]
0.873406 38/652 7 2.10293e-114 GCF_000910735.1_ViralProj214366_genomic.fna.gz NC_021929.1 Malvastrum leaf curl Philippines betasatellite, complete sequence
0.858899 41/1000 2 6.9149e-117 GCF_000824885.1_H3016_genomic.fna.gz [101 seqs] NZ_CCMT01000002.1 Helicobacter pylori genome assembly H3016, contig H3016_001, whole genome shotgun sequence [...]
0.856856 39/1000 1 2.09924e-110 GCF_000287295.1_ASM28729v1_genomic.fna.gz NC_018417.1 Candidatus Carsonella ruddii HT isolate Thao2000, complete genome
For ACHIM_22-1
Writing output...
0.990652 821/1000 81 0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...]
0.985891 742/1000 24 0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...]
0.889242 85/1000 23 2.54906e-296 GCF_001481725.1_ASM148172v1_genomic.fna.gz NZ_CP011914.1 Eubacterium limosum strain SA11, complete genome
0.887203 81/1000 83 1.07846e-280 GCF_000452225.2_ASM45222v2_genomic.fna.gz [58 seqs] NZ_AVNB01000001.1 Paraclostridium bifermentans ATCC 19299 gcdATCC19299.contig.0, whole genome shotgun sequence [...]
0.874487 39/652 3 9.64658e-131 GCF_000910735.1_ViralProj214366_genomic.fna.gz NC_021929.1 Malvastrum leaf curl Philippines betasatellite, complete sequence
For ACHIM_51
Writing output...
0.998941 978/1000 107 0 GCF_000712925.1_psaLyso211_genomic.fna.gz [49 seqs] NZ_JOPV01000001.1 Staphylococcus warneri Lyso 2 2011 psaLyso211.contig.0, whole genome shotgun sequence [...]
0.852771 23/652 2 6.06497e-80 GCF_000910735.1_ViralProj214366_genomic.fna.gz NC_021929.1 Malvastrum leaf curl Philippines betasatellite, complete sequence
0.834744 15/666 13 1.04014e-49 GCF_000922435.1_ViralProj259986_genomic.fna.gz NC_024777.1 Small begomovirus-associated satellite isolate Sa19-S1, complete sequence
0.831966 21/1000 4 2.45625e-68 GCF_000920515.2_ViralProj243492_genomic.fna.gz NC_023876.2 Andrographis yellow vein leaf curl betasatellite clone bt-2, complete sequence
0.828011 19/1000 4 3.83459e-61 GCF_001429995.1_ViralProj300248_genomic.fna.gz NC_028116.1 Ageratum conyzoides symptomless alphasatellite isolate WOK80, complete sequence
For H.pylori Kx-1090A
0.974793 585/1000 148 0 GCF_000498335.1_ASM49833v1_genomic.fna.gz NC_022911.1 Helicobacter pylori BM012S, complete genome
0.92688 203/1000 144 0 GCF_000824885.1_H3016_genomic.fna.gz [101 seqs] NZ_CCMT01000002.1 Helicobacter pylori genome assembly H3016, contig H3016_001, whole genome shotgun sequence [...]
0.891666 90/1000 135 0 GCF_000013245.1_ASM1324v1_genomic.fna.gz [2 seqs] NC_008086.1 Helicobacter pylori HPAG1, complete genome [...]
0.874615 60/1000 135 9.96005e-210 GCF_000299835.1_ASM29983v1_genomic.fna.gz [10 seqs] NZ_AMOS01000001.1 Helicobacter pylori R32b HPR32b.contig.0, whole genome shotgun sequence [...]
0.867054 50/1000 142 5.34247e-171 GCF_000824865.1_H3014_genomic.fna.gz [51 seqs] NZ_CCMU01000002.1 Helicobacter pylori genome assembly H3014, contig H3014_001, whole genome shotgun sequence [...]
For H.pylori Kx-96A
0.972684 559/1000 113 0 GCF_000299835.1_ASM29983v1_genomic.fna.gz [10 seqs] NZ_AMOS01000001.1 Helicobacter pylori R32b HPR32b.contig.0, whole genome shotgun sequence [...]
0.91934 171/1000 109 0 GCF_900087475.1_TC-1_1_genomic.fna.gz [78 seqs] NZ_FLOM01000001.1 Helicobacter pylori isolate TC-1_1, whole genome shotgun sequence [...]
0.911545 143/1000 154 0 GCF_000894175.1_ViralProj80923_genomic.fna.gz NC_016568.1 Helicobacter phage phiHP33, complete genome
0.880455 69/1000 109 2.82977e-253 GCF_000275125.1_ASM27512v1_genomic.fna.gz [16 seqs] NZ_AKOI01000001.1 Helicobacter pylori Hp H-28 HpH_28.contig.0_1, whole genome shotgun sequence [...]
0.875981 62/1000 108 8.72094e-225 GCF_000013245.1_ASM1324v1_genomic.fna.gz [2 seqs] NC_008086.1 Helicobacter pylori HPAG1, complete genome [...]
For LE-Ecoli
Writing output...
0.998649 972/1000 62 0 GCF_000025745.1_ASM2574v1_genomic.fna.gz NC_017628.1 Escherichia coli IHE3034, complete genome
0.989958 809/1000 58 0 GCF_000846325.1_ViralProj14414_genomic.fna.gz NC_001609.1 Enterobacteria phage P4, complete genome
0.964685 470/1000 61 0 GCF_000839125.1_ViralProj14162_genomic.fna.gz NC_003444.1 Enterobacteria phage SfV, complete genome
0.964095 464/1000 59 0 GCF_000836905.1_ViralProj14035_genomic.fna.gz NC_001895.1 Enterobacteria phage P2, complete genome
0.958543 411/1000 60 0 GCF_000930115.1_ViralProj271780_genomic.fna.gz NC_026014.1 Enterobacteria phage P88, complete genome
For the ACHIM strains, ACHIM_22-1 was initially a mixture from silva screening and ACHIM_51 was a pure isolate.
From the above results, it is clear that we could use mash screen
to have an idea of what species we have in our query genomes but also what reference genome we can use for annotation genomes using prokka.
N.B. Sorry I am going to go through the Markdown tutorial. I am just starting to use it. Pardon me please.
Nice job! 👍 Let's discuss this in more detail in our meeting in 10 minutes :).
This is really interesting! I would love to see what output my simple "classification script", assess_mash_screen.py, gives on the different output files you produced (mashscreengenomes.tab
). If it gives accurate results it means we can start implementing mash screen for both contaminant and species identification in BACTpipe :)
I just thought about a way to possibly improve the specificity of species identification. However, it might come with some ceaveats that I haven't figured out yet, so please consider possible errors in my reasoning. I'm just toying with some ideas here, so feel free to correct me if I make any wrong assumptions.
From Mash we get a percent identity, but also the proportion of kmers that matches (which is used internally in Mash to compute the percent identity). Combining these values in a clever way might improve our assessment of the species in a sample, e.g. maybe even something simple like percent_identity * (shared_kmers / total_kmers)
combined with a double threshold system where we have a minimum threshold for any kind of species prediction, and a second threshold that is dynamically computed depending on the top ranking hit (similar to what we do now) to discard similar-but-not-similar-enough reference sequences. Trying this on Sample_171013-P280
would give the following results:
percent_identiy | shared_hashes | total_hashes | species | classification_score |
---|---|---|---|---|
0.990364 | 816 | 1000 | bifermentans | 0.808137 |
0.988838 | 790 | 1000 | bifermentans | 0.781182 |
0.97868 | 636 | 1000 | clostridium | 0.62244 |
0.977434 | 618 | 1000 | candidatus dorea | 0.604054 |
0.970481 | 533 | 1000 | bifermentans | 0.517266 |
0.961242 | 436 | 1000 | benzoelyticum | 0.419102 |
If we have a minimum first threshold of classification at a score of say 0.75
(this is not a percent identity-score any more), and then use a second dynamic threshold of maximum 0.15
from the top hit, we get a pure classification of bifermentans
for this sample.
I'm unsure of how to best make use of the multiplicity value (the third column in the mash screen output), maybe that can also be incorporated somehow.
I'll implement the most recent classification method based on classification_score
as described in my previous post. Then @b16joski can rerun mash screen
on all samples using the new classification method so we can see how it performs.
All ACHIM whole genomes sequenced so far were screened with mash to classify both mixtures and pure isolates. This would provide a hint of what species we most likely have in query samples.
Here is the directory of query Achim genomes to screen with mash
/home/joseph.kirangwa/ACHIM
ACHIM_24-1_R1.fastq.gz ACHIM_27-1_R2.fastq.gz ACHIM_35-3_R1.fastq.gz ACHIM_40-3_R2.fastq.gz ACHIM_47-5_R1.fastq.gz ACHIM_59_R2.fastq.gz
ACHIM_22-1_R1.fastq.gz ACHIM_24-1_R2.fastq.gz ACHIM_27-2_R1.fastq.gz ACHIM_35-3_R2.fastq.gz ACHIM_40-4_R1.fastq.gz ACHIM_47-5_R2.fastq.gz ACHIM_60_R1.fastq.gz
ACHIM_22-1_R2.fastq.gz ACHIM_24-2_R1.fastq.gz ACHIM_27-2_R2.fastq.gz ACHIM_35-4_R1.fastq.gz ACHIM_40-4_R2.fastq.gz ACHIM_48-1_R1.fastq.gz ACHIM_60_R2.fastq.gz
ACHIM_22-2_R1.fastq.gz ACHIM_24-2_R2.fastq.gz ACHIM_27-3_R1.fastq.gz ACHIM_35-4_R2.fastq.gz ACHIM_40-5_R1.fastq.gz ACHIM_48-1_R2.fastq.gz ACHIM_61_R1.fastq.gz
ACHIM_22-2_R2.fastq.gz ACHIM_24-3_R1.fastq.gz ACHIM_27-3_R2.fastq.gz ACHIM_35-5_R1.fastq.gz ACHIM_40-5_R2.fastq.gz ACHIM_51_R1.fastq.gz ACHIM_61_R2.fastq.gz
ACHIM_22-3_R1.fastq.gz ACHIM_24-3_R2.fastq.gz ACHIM_27-4_R1.fastq.gz ACHIM_35-5_R2.fastq.gz ACHIM_44-1_R1.fastq.gz ACHIM_51_R2.fastq.gz ACHIM_62_R1.fastq.gz
ACHIM_22-3_R2.fastq.gz ACHIM_24-4_R1.fastq.gz ACHIM_27-4_R2.fastq.gz ACHIM_38-1_R1.fastq.gz ACHIM_44-1_R2.fastq.gz ACHIM_52_R1.fastq.gz ACHIM_62_R2.fastq.gz
ACHIM_22-4_R1.fastq.gz ACHIM_24-4_R2.fastq.gz ACHIM_27-5_R1.fastq.gz ACHIM_38-1_R2.fastq.gz ACHIM_44-2_R1.fastq.gz ACHIM_52_R2.fastq.gz ACHIM_63_R1.fastq.gz
ACHIM_22-4_R2.fastq.gz ACHIM_24-5_R1.fastq.gz ACHIM_27-5_R2.fastq.gz ACHIM_38-2_R1.fastq.gz ACHIM_44-2_R2.fastq.gz ACHIM_53_R1.fastq.gz ACHIM_63_R2.fastq.gz
ACHIM_22-5_R1.fastq.gz ACHIM_24-5_R2.fastq.gz ACHIM_33-1_R1.fastq.gz ACHIM_38-2_R2.fastq.gz ACHIM_44-3_R1.fastq.gz ACHIM_53_R2.fastq.gz
ACHIM_22-5_R2.fastq.gz ACHIM_26-1_R1.fastq.gz ACHIM_33-1_R2.fastq.gz ACHIM_38-3_R1.fastq.gz ACHIM_44-3_R2.fastq.gz ACHIM_54_R1.fastq.gz
ACHIM_23-1_R1.fastq.gz ACHIM_26-1_R2.fastq.gz ACHIM_33-2_R1.fastq.gz ACHIM_38-3_R2.fastq.gz ACHIM_44-5_R1.fastq.gz ACHIM_54_R2.fastq.gz
ACHIM_23-1_R2.fastq.gz ACHIM_26-2_R1.fastq.gz ACHIM_33-2_R2.fastq.gz ACHIM_38-4_R1.fastq.gz ACHIM_44-5_R2.fastq.gz ACHIM_55_R1.fastq.gz
ACHIM_23-2_R1.fastq.gz ACHIM_26-2_R2.fastq.gz ACHIM_33-3_R1.fastq.gz ACHIM_38-4_R2.fastq.gz ACHIM_47-1_R1.fastq.gz ACHIM_55_R2.fastq.gz
ACHIM_23-2_R2.fastq.gz ACHIM_26-3_R1.fastq.gz ACHIM_33-3_R2.fastq.gz ACHIM_38-5_R1.fastq.gz ACHIM_47-1_R2.fastq.gz ACHIM_56_R1.fastq.gz
ACHIM_23-3_R1.fastq.gz ACHIM_26-3_R2.fastq.gz ACHIM_33-5_R1.fastq.gz ACHIM_38-5_R2.fastq.gz ACHIM_47-2_R1.fastq.gz ACHIM_56_R2.fastq.gz
ACHIM_23-3_R2.fastq.gz ACHIM_26-4_R1.fastq.gz ACHIM_33-5_R2.fastq.gz ACHIM_40-1_R1.fastq.gz ACHIM_47-2_R2.fastq.gz ACHIM_57_R1.fastq.gz
ACHIM_23-4_R1.fastq.gz ACHIM_26-4_R2.fastq.gz ACHIM_35-1_R1.fastq.gz ACHIM_40-1_R2.fastq.gz ACHIM_47-3_R1.fastq.gz ACHIM_57_R2.fastq.gz
ACHIM_23-4_R2.fastq.gz ACHIM_26-5_R1.fastq.gz ACHIM_35-1_R2.fastq.gz ACHIM_40-2_R1.fastq.gz ACHIM_47-3_R2.fastq.gz ACHIM_58_R1.fastq.gz
ACHIM_23-5_R1.fastq.gz ACHIM_26-5_R2.fastq.gz ACHIM_35-2_R1.fastq.gz ACHIM_40-2_R2.fastq.gz ACHIM_47-4_R1.fastq.gz ACHIM_58_R2.fastq.gz
ACHIM_23-5_R2.fastq.gz ACHIM_27-1_R1.fastq.gz ACHIM_35-2_R2.fastq.gz ACHIM_40-3_R1.fastq.gz ACHIM_47-4_R2.fastq.gz ACHIM_59_R1.fastq.gz
/home/joseph.kirangwa/ACHIM/mash.nextflow.nf
#!/usr/bin/env nextflow
//Definition of default parameters
params.reads = '*_R{1,2}.fastq.gz'
params.database = "/db/refseq/*.msh"
//Parsing the input
ref_database = file( params.database )
//validate input file
if( !ref_database.exists() ) exit 1, "Missing reference database file: ${ref_database}"
//define input channel for paired end reads
Channel
.fromFilePairs( params.reads )
.ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
.set { read_pairs }
//define mash proces
process mash {
cpus 8
tag { pair_id }
publishDir "./mash.screen", mode: 'copy'
input:
set pair_id, file(reads) from read_pairs
output:
set pair_id, file("${pair_id}.reads.csv") into next_ch
script:
"""
mash screen -w -p ${task.cpus} ${ref_database[0]} ${reads[0]} ${reads[1]} > ${pair_id}.reads.csv
"""
}
//sort and classify mash screen outputs using `assess_mash_screen.py`
process mashsort {
validExitStatus 0,1,2
tag {sample_id}
publishDir "./mash.screen_sorted", mode: 'copy'
input:
set sample_id, file(achim) from next_ch
output:
file("${sample_id}.mashscreen.tab")
script:
"""
sort -gr ${achim} | head -5 > ${sample_id}.achim.sorted.tab
python /home/joseph.kirangwa/mash-script/mash_scripts/assess_mash_screen.py ${sample_id}.achim.sorted.tab -o ${sample_id}.mashscreen.tab
"""
}
workflow.onComplete {
println ( workflow.success ? "Done!" : "Oops .. something went wrong" )
}
(base) [joseph.kirangwa@ctmr-nas ACHIM]$ nextflow run mash.nextflow.nf
WARNING: Sample 'ACHIM_22-1' likely contains more than one species: Paraclostridium bifermentans, Eubacterium limosum
0.990652 (821, 1000) 81 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.8133252919999999
0.985891 (742, 1000) 24 0.0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...] 0.7315311219999999
Sample 'ACHIM_22-2' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_22-3' probably consist of only a single species: Paraclostridium bifermentans
WARNING: Sample 'ACHIM_22-4' likely contains more than one species: Paraclostridium bifermentans, Eubacterium limosum
0.990422 (817, 1000) 38 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.8091747739999999
0.985573 (737, 1000) 9 0.0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...] 0.726367301
WARNING: Sample 'ACHIM_22-5' likely contains more than one species: Paraclostridium bifermentans, Eubacterium limosum
0.990479 (818, 1000) 38 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.810211822
0.9857 (739, 1000) 10 0.0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...] 0.7284323
WARNING: Sample 'ACHIM_23-1' likely contains more than one species: Bacteroides uniformis, Paraclostridium bifermentans
0.994092 (883, 1000) 64 0.0 GCF_001406135.1_13470_2_66_genomic.fna.gz [20 seqs] NZ_CZAF01000001.1 Bacteroides uniformis strain 2789STDY5834847, whole genome shotgun sequence [...] 0.877783236
0.988719 (788, 1000) 5 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.779110572
Sample 'ACHIM_23-2' probably consist of only a single species: Bacteroides uniformis
Sample 'ACHIM_23-3' probably consist of only a single species: Bacteroides uniformis
Sample 'ACHIM_23-4' probably consist of only a single species: Bacteroides sp.
WARNING: Sample 'ACHIM_23-5' likely contains more than one species: Paraclostridium bifermentans, Bacteroides sp.
0.986582 (753, 1000) 3 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.7428962459999999
0.985764 (740, 1000) 41 0.0 GCF_000218365.1_Bacteroides_sp_1_1_30_V1_genomic.fna.gz [73 seqs] NZ_GL945090.1 Bacteroides sp. 1_1_30 genomic scaffold supercont1.1, whole genome shotgun sequence [...] 0.72946536
Sample 'ACHIM_24-1' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_24-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_24-3' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_24-4' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_24-5' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_26-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_26-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_26-3' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_26-4' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_26-5' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_27-1' probably consist of only a single species: Eubacterium limosum
WARNING: Sample 'ACHIM_27-2' likely contains more than one species: Paraclostridium bifermentans, Eubacterium limosum
0.989016 (793, 1000) 6 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.7842896880000001
0.9857 (739, 1000) 19 0.0 GCF_900103155.1_IMG-taxon_2642422557_annotated_assembly_genomic.fna.gz [87 seqs] NZ_FNJF01000082.1 Eubacterium limosum strain 32_A2, whole genome shotgun sequence [...] 0.7284323
Sample 'ACHIM_27-3' probably consist of only a single species: Eubacterium limosum
Sample 'ACHIM_27-4' probably consist of only a single species: Eubacterium limosum
Sample 'ACHIM_27-5' probably consist of only a single species: Eubacterium limosum
Sample 'ACHIM_33-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_33-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_33-3' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_33-5' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-3' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-4' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_35-5' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_38-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_38-2' probably consist of only a single species: Prevotella phocaeensis
WARNING: Sample 'ACHIM_38-3' likely contains more than one species: Prevotella phocaeensis, Bacteroides uniformis
0.999135 (982, 1000) 21 0.0 GCF_900065875.1_PRJEB126981_genomic.fna.gz [12 seqs] NZ_LT160615.1 Prevotella phocaeensis strain SN19, whole genome shotgun sequence [...] 0.9811505699999999
0.994039 (882, 1000) 52 0.0 GCF_001406135.1_13470_2_66_genomic.fna.gz [20 seqs] NZ_CZAF01000001.1 Bacteroides uniformis strain 2789STDY5834847, whole genome shotgun sequence [...] 0.876742398
Sample 'ACHIM_38-4' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_38-5' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-1' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-2' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-3' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-4' probably consist of only a single species: Prevotella phocaeensis
Sample 'ACHIM_40-5' probably consist of only a single species: Prevotella phocaeensis
WARNING: Sample 'ACHIM_44-1' likely contains more than one species: Clostridium perfringens, Paraclostridium bifermentans
0.992346 (851, 1000) 84 0.0 GCF_000512415.1_JJC_1.0_genomic.fna.gz [69 seqs] NZ_AWRZ01000001.1 Clostridium perfringens JJC contig1, whole genome shotgun sequence [...] 0.8444864459999999
0.989899 (808, 1000) 7 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.7998383920000001
WARNING: Sample 'ACHIM_44-2' likely contains more than one species: Paraclostridium bifermentans, Clostridium perfringens
0.992346 (851, 1000) 140 0.0 GCF_000512415.1_JJC_1.0_genomic.fna.gz [69 seqs] NZ_AWRZ01000001.1 Clostridium perfringens JJC contig1, whole genome shotgun sequence [...] 0.8444864459999999
0.990306 (815, 1000) 22 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.80709939
Sample 'ACHIM_44-3' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_44-5' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_47-1' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_47-2' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_47-3' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_47-4' probably consist of only a single species: Clostridium perfringens
WARNING: Sample 'ACHIM_47-5' likely contains more than one species: Clostridium perfringens, Paraclostridium bifermentans
0.992235 (849, 1000) 69 0.0 GCF_000512415.1_JJC_1.0_genomic.fna.gz [69 seqs] NZ_AWRZ01000001.1 Clostridium perfringens JJC contig1, whole genome shotgun sequence [...] 0.842407515
0.989782 (806, 1000) 6 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.7977642920000001
Sample 'ACHIM_48-1' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_48-2' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_48-3' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_48-4' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_48-5' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_49-1' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_49-2' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_51' probably consist of only a single species: Staphylococcus warneri
Sample 'ACHIM_52' probably consist of only a single species: Paraclostridium bifermentans
WARNING: Sample 'ACHIM_53' likely contains more than one species: Staphylococcus warneri, Bacteroides uniformis
0.994092 (883, 1000) 102 0.0 GCF_001406135.1_13470_2_66_genomic.fna.gz [20 seqs] NZ_CZAF01000001.1 Bacteroides uniformis strain 2789STDY5834847, whole genome shotgun sequence [...] 0.877783236
0.989489 (801, 1000) 2 0.0 GCF_000712925.1_psaLyso211_genomic.fna.gz [49 seqs] NZ_JOPV01000001.1 Staphylococcus warneri Lyso 2 2011 psaLyso211.contig.0, whole genome shotgun sequence [...] 0.792580689
Sample 'ACHIM_54' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_55' probably consist of only a single species: Paraclostridium bifermentans
WARNING: Sample 'ACHIM_56' likely contains more than one species: Bacteroides uniformis, Paraclostridium bifermentans
0.993985 (881, 1000) 9 0.0 GCF_001406135.1_13470_2_66_genomic.fna.gz [20 seqs] NZ_CZAF01000001.1 Bacteroides uniformis strain 2789STDY5834847, whole genome shotgun sequence [...] 0.875700785
0.990306 (815, 1000) 51 0.0 GCF_000498455.1_WYM_1.0_genomic.fna.gz [180 seqs] NZ_AVSU01000001.1 Paraclostridium bifermentans WYM contig1, whole genome shotgun sequence [...] 0.80709939
Sample 'ACHIM_57' probably consist of only a single species: Clostridium sp.
Sample 'ACHIM_58' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_59' probably consist of only a single species: Paraclostridium bifermentans
Sample 'ACHIM_60' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_61' probably consist of only a single species: Clostridium perfringens
Sample 'ACHIM_62' probably consist of only a single species: Clostridium sp.
Sample 'ACHIM_63' probably consist of only a single species: Clostridium sporogenes
Looks like we have some false positive probable single species outputs for example Achim_22-2, Achim_22-3, ACHIM_59 when I compare with previous silver classification. We could probably maximise on specificity.
ACHIM_22-1 FAIL Eubacterium limosum, Paraclostridium bifermentans
ACHIM_22-2 FAIL Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-3 FAIL Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-4 FAIL Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-5 FAIL Eubacterium limosum, Paraclostridium bifermentans
ACHIM_23-1 FAIL Bacteroides uniformis, Paraclostridium bifermentans
ACHIM_23-2 PASS Bacteroides uniformis
ACHIM_23-3 PASS Bacteroides uniformis
ACHIM_23-4 PASS Bacteroides sp.
ACHIM_23-5 FAIL Bacteroides sp., Paraclostridium bifermentans
ACHIM_24-1 PASS Paraclostridium bifermentans
ACHIM_24-2 FAIL Prevotella phocaeensis, Paraclostridium bifermentans
ACHIM_24-3 PASS Paraclostridium bifermentans
ACHIM_24-4 PASS Paraclostridium bifermentans
ACHIM_24-5 PASS Paraclostridium bifermentans
ACHIM_26-1 PASS Prevotella phocaeensis
ACHIM_26-2 PASS Prevotella phocaeensis
ACHIM_26-3 PASS Prevotella phocaeensis
ACHIM_26-4 PASS Prevotella phocaeensis
ACHIM_26-5 PASS Prevotella phocaeensis
ACHIM_27-1 PASS Eubacterium limosum
ACHIM_27-2 FAIL Eubacterium limosum, Paraclostridium bifermentans
ACHIM_27-3 PASS Eubacterium limosum
ACHIM_27-4 PASS Eubacterium limosum
ACHIM_27-5 PASS Eubacterium limosum
ACHIM_33-1 PASS Prevotella phocaeensis
ACHIM_33-2 PASS Prevotella phocaeensis
ACHIM_33-3 PASS Prevotella phocaeensis
ACHIM_33-5 PASS Prevotella phocaeensis
ACHIM_35-1 PASS Prevotella phocaeensis
ACHIM_35-2 PASS Prevotella phocaeensis
ACHIM_35-3 PASS Prevotella phocaeensis
ACHIM_35-4 PASS Prevotella phocaeensis
ACHIM_35-5 PASS Prevotella phocaeensis
ACHIM_38-1 PASS Prevotella phocaeensis
ACHIM_38-2 PASS Prevotella phocaeensis
ACHIM_38-3 FAIL Prevotella phocaeensis, Bacteroides uniformis
ACHIM_38-4 PASS Prevotella phocaeensis
ACHIM_38-5 PASS Prevotella phocaeensis
ACHIM_40-1 PASS Prevotella phocaeensis
ACHIM_40-2 PASS Prevotella phocaeensis
ACHIM_40-3 PASS Prevotella phocaeensis
ACHIM_40-4 PASS Prevotella phocaeensis
ACHIM_40-5 PASS Prevotella phocaeensis
ACHIM_44-1 FAIL Paraclostridium bifermentans, Clostridium perfringens
ACHIM_44-2 FAIL Clostridium perfringens, Paraclostridium bifermentans
ACHIM_44-3 PASS Paraclostridium bifermentans
ACHIM_44-5 PASS Clostridium perfringens
ACHIM_47-1 PASS Paraclostridium bifermentans
ACHIM_47-2 PASS Paraclostridium bifermentans
CHIM_47-3 PASS Clostridium perfringens
ACHIM_47-4 PASS Clostridium perfringens
ACHIM_47-5 FAIL Clostridium perfringens, Paraclostridium bifermentans
ACHIM_48-1 PASS Paraclostridium bifermentans
ACHIM_48-2 PASS Paraclostridium bifermentans
ACHIM_48-3 PASS Paraclostridium bifermentans
ACHIM_48-4 PASS Paraclostridium bifermentans
ACHIM_48-5 PASS Paraclostridium bifermentans
ACHIM_49-1 PASS Clostridium perfringens
ACHIM_49-2 PASS Clostridium perfringens
ACHIM_51 PASS Staphylococcus warneri
ACHIM_52 PASS Paraclostridium bifermentans
ACHIM_53 FAIL Bacteroides uniformis, Staphylococcus warneri
ACHIM_54 PASS Paraclostridium bifermentans
ACHIM_55 PASS Paraclostridium bifermentans
ACHIM_56 FAIL Paraclostridium bifermentans, Bacteroides uniformis
ACHIM_57 PASS Clostridium sp.
ACHIM_58 PASS Paraclostridium bifermentans
ACHIM_59 PASS Paraclostridium bifermentans
ACHIM_60 PASS Clostridium perfringens
ACHIM_61 PASS Clostridium perfringens
ACHIM_62 PASS Clostridium sp.
ACHIM_63 PASS Clostridium sporogenes
Nice!
I've pushed a new version of my assess_mash_screen.py
script to classify the output from mash screen
. I hope it works well, I haven't really tested it properly...
https://github.com/boulund/mash_scripts/commit/f4955bb3665788b2b7243aac58d169a8e6fed0ef
After our meeting today, I updated the assess_mash_screen.py
script with a new output format that makes it easier to compare outputs, and I also changed the default classification score modifier to 0.20
.
https://github.com/boulund/mash_scripts/commit/d166d36560d56b5b3e7614aa3959c1e81cf2abe8
I added results from the rerun of assess_mash_screen.py using the default classification score of 0.20.
@b16joski, Thanks for that. I would have liked to see some assessment of whether it improved the results or not? Please write a summary of your analysis as well. Also, it would be much nicer if you posted it as a new comment instead of editing the previous one--it makes it much easier to follow the series of events.
From the previous classification of mash screen results using 0.15 as the classification score modifier (reduction), some of the expected mixed isolates turned out pure. So we decided to relax the classification score modifier to 0.20 and compare results.
The outputs from mash screen *.tsv
files were used as inputs for the script assess_mash_screen.py
Briefly the script was run as follows;
for filename in *.csv; do
python ./assess_mash_screen.py -c 0.20 $filename
done
ACHIM_22-1 FAIL Eubacterium limosum, Paraclostridium bifermentans
ACHIM_22-2 FAIL Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-3 FAIL Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-4 FAIL Paraclostridium bifermentans, Eubacterium limosum
ACHIM_22-5 FAIL Eubacterium limosum, Paraclostridium bifermentans
ACHIM_23-1 FAIL Bacteroides uniformis, Paraclostridium bifermentans
ACHIM_23-2 PASS Bacteroides uniformis
ACHIM_23-3 PASS Bacteroides uniformis
ACHIM_23-4 PASS Bacteroides sp.
ACHIM_23-5 FAIL Bacteroides sp., Paraclostridium bifermentans
ACHIM_24-1 PASS Paraclostridium bifermentans
ACHIM_24-2 FAIL Prevotella phocaeensis, Paraclostridium bifermentans
ACHIM_24-3 PASS Paraclostridium bifermentans
ACHIM_24-4 PASS Paraclostridium bifermentans
ACHIM_24-5 PASS Paraclostridium bifermentans
ACHIM_26-1 PASS Prevotella phocaeensis
ACHIM_26-2 PASS Prevotella phocaeensis
ACHIM_26-3 PASS Prevotella phocaeensis
ACHIM_26-4 PASS Prevotella phocaeensis
ACHIM_26-5 PASS Prevotella phocaeensis
ACHIM_27-1 PASS Eubacterium limosum
ACHIM_27-2 FAIL Eubacterium limosum, Paraclostridium bifermentans
ACHIM_27-3 PASS Eubacterium limosum
ACHIM_27-4 PASS Eubacterium limosum
ACHIM_27-5 PASS Eubacterium limosum
ACHIM_33-1 PASS Prevotella phocaeensis
ACHIM_33-2 PASS Prevotella phocaeensis
ACHIM_33-3 PASS Prevotella phocaeensis
ACHIM_33-5 PASS Prevotella phocaeensis
ACHIM_35-1 PASS Prevotella phocaeensis
ACHIM_35-2 PASS Prevotella phocaeensis
ACHIM_35-3 PASS Prevotella phocaeensis
ACHIM_35-4 PASS Prevotella phocaeensis
ACHIM_35-5 PASS Prevotella phocaeensis
ACHIM_38-1 PASS Prevotella phocaeensis
ACHIM_38-2 PASS Prevotella phocaeensis
ACHIM_38-3 FAIL Prevotella phocaeensis, Bacteroides uniformis
ACHIM_38-4 PASS Prevotella phocaeensis
ACHIM_38-5 PASS Prevotella phocaeensis
ACHIM_40-1 PASS Prevotella phocaeensis
ACHIM_40-2 PASS Prevotella phocaeensis
ACHIM_40-3 PASS Prevotella phocaeensis
ACHIM_40-4 PASS Prevotella phocaeensis
ACHIM_40-5 PASS Prevotella phocaeensis
ACHIM_44-1 FAIL Paraclostridium bifermentans, Clostridium perfringens
ACHIM_44-2 FAIL Clostridium perfringens, Paraclostridium bifermentans
ACHIM_44-3 PASS Paraclostridium bifermentans
ACHIM_44-5 PASS Clostridium perfringens
ACHIM_47-1 PASS Paraclostridium bifermentans
ACHIM_47-2 PASS Paraclostridium bifermentans
CHIM_47-3 PASS Clostridium perfringens
ACHIM_47-4 PASS Clostridium perfringens
ACHIM_47-5 FAIL Clostridium perfringens, Paraclostridium bifermentans
ACHIM_48-1 PASS Paraclostridium bifermentans
ACHIM_48-2 PASS Paraclostridium bifermentans
ACHIM_48-3 PASS Paraclostridium bifermentans
ACHIM_48-4 PASS Paraclostridium bifermentans
ACHIM_48-5 PASS Paraclostridium bifermentans
ACHIM_49-1 PASS Clostridium perfringens
ACHIM_49-2 PASS Clostridium perfringens
ACHIM_51 PASS Staphylococcus warneri
ACHIM_52 PASS Paraclostridium bifermentans
ACHIM_53 FAIL Bacteroides uniformis, Staphylococcus warneri
ACHIM_54 PASS Paraclostridium bifermentans
ACHIM_55 PASS Paraclostridium bifermentans
ACHIM_56 FAIL Paraclostridium bifermentans, Bacteroides uniformis
ACHIM_57 PASS Clostridium sp.
ACHIM_58 PASS Paraclostridium bifermentans
ACHIM_59 PASS Paraclostridium bifermentans
ACHIM_60 PASS Clostridium perfringens
ACHIM_61 PASS Clostridium perfringens
ACHIM_62 PASS Clostridium sp.
ACHIM_63 PASS Clostridium sporogenes
Originally samples (ACHIM 22-1, ACHIM 22-22, ACHIM 22-3, ACHIM 22-4, ACHIM 22-5,
ACHIM 53 ACHIM 56) were manually classified as mix using SILVA database. While using the
classification score of 0.15 some turned out pure. However, adjusting the classification score
to 0.20, these came out as FAIL
in the adjusted classification score of 0.20.
Of interest is ACHIM 59, found to be a mix with SILVA and turning out pure with the classification
score of 0.20. It would be nice to asses the percentage error rate here.
Great job! So it seems as if we have a detection rate of 72 out of 73 samples in this test, is that correct? I realize that the selection of samples probably isn't very varied in this case, but having a mixture detection rate accuracy of about 98% is actually quite good considering how basic the method we're using is... I'm entirely OK with continuing with everything as-is after seeing these results, but let's keep an eye out for how it performs on future samples. Looking forward to seeing the new code implemented soon 👍
Now I have pushed the new version of BACTpipev2.0 on github. I tested it out. I am looking at the results folders. I will give some comments. The code runs well without errors while simulating what we discussed theoretically in our previous meeting.
I have two suggestions
contigs.fa
from the de novo assembly process. We can compare contigs.fa
with either a customised database of highly curated references. This would help validate mash screen initial predication. Pyani is good alternative but needs inputs containing both query samples and a set of references in one directory to compute the percentage identities between the queries and the ref seqs. Hope I have brought out the idea well. What do you think? This is very good. I think we can start to consider this issue resolved by now (and also #26 ). However, let's keep this issue open until we merge your local changes into the main repository.
I like it that you have more ideas for improvements @b16joski! Create a separate issue for each of your ideas so we can discuss them separately on their own as they are quite independent and not really connected to this issue either.
We just merged BACTpipeV2.0
into master
.
We need to determine the best way to identify a reference species based on the assembled sequences (
contigs.fa
). Current ideas have touched upon MASH, but also Jspecies, etc. Still no solution.