bcgsc / NanoSim

Nanopore sequence read simulator
Other
238 stars 57 forks source link

Variation in relative abundances of simulated Zymo mock microbes #213

Closed ezherman closed 2 weeks ago

ezherman commented 2 months ago

Hi,

We've been trying to simulate Zymo mock communities using your "even" pre-trained model. We've consistently found Staphylococcus aureus to be seemingly undersimulated, while Cryptococcus neoformans appears to be oversimulated. The former should return with a relative abundance of approx. 12 while the latter should have approx. a relative abundance of 2. I've included instructions below to reproduce the problem. Could you please advise as to what we may be doing wrong? Thanks in advance!

Instructions

  1. Clone the NanoSim repository onto your machine:

    git clone https://github.com/bcgsc/NanoSim.git
  2. Download the Zymo reference genomes using this link.

  3. Unzip the ZymoBIOMICS.STD.refseq.v2 directory into a new ref_metagenome directory.

  4. Unzip and untar the metagenome_ERR3152364_Even.tar.gz directory:

tar -xzvf pre-trained_models/metagenome_ERR3152364_Even.tar.gz -C pre-trained_models/
  1. In the sample_config_file/metagenome_list_for_training and sample_config_file/metagenome_list_for_simulation files, correct the reference genome directory of Cryptococcus neoformans to ref_metagenome/ZymoBIOMICS.STD.refseq.v2/Genomes/Cryptococcus_neoformans_draft_genome.fasta.

  2. In the same files, correct the reference genome directory of Saccharomyces cerevisiae to ref_metagenome/ZymoBIOMICS.STD.refseq.v2/Genomes/Saccharomyces_cerevisiae_draft_genome.fasta.

  3. Create an environment:

    mamba create -n nanosim -c bioconda nanosim numpy=1.21.5
  4. After activating the environment, simulate reads:

src/simulator.py metagenome -gl sample_config_file/metagenome_list_for_simulation -a sample_config_file/abundance_for_simulation_multi_sample.tsv -dl sample_config_file/dna_type_list.tsv -c pre-trained_models/metagenome_ERR3152364_Even/training -b guppy --fastq -t 8
  1. To quantify the species abundance in the simulated data:
src/read_analysis.py quantify -e meta -i simulated_sample0_aligned_reads.fastq -gl sample_config_file/metagenome_list_for_training -o quantification -t 8
  1. To view the quantification.tsv:
head -11 quantification_quantification.tsv 

Which for example can show:

Species Abundance
Bacillus-subtilis       12.016698792315037
Cryptococcus-neoformans 7.420805443075414
Enterococcus-faecalis   12.013381517088918
Escherichia-coli        12.015332943820978
Lactobacillus-fermentum 12.015202663107305
Listeria-monocytogenes  12.011862465946725
Pseudomonas-aeruginosa  12.023731204767564
Saccharomyces-cerevisiae        2.39457544028796
Salmonella-enterica     12.020778921780073
Staphylococcus-aureus   6.067630607809992
SaberHQ commented 2 months ago

Hey @cheny19 I will be happy if you can take a look at this issue.

ezherman commented 3 weeks ago

Hi NanoSim team, would it be possible to receive an update on this? I have observed deviations from the expected abundances in subsequent simulations too (using different community compositions).

As a workaround I can calculate what the returned abundance is using the organism names in the fastq headers, however it'd be helpful to understand what might be driving these deviations.

lcoombe commented 3 weeks ago

Hi @ezherman,

Thanks for following-up. I believe that I have identified where in the code this is erroneously happening - I'm working on a fix and will update you when it is merged to master branch! I'm delaying our next release until we can integrate this - I hope to get that out next week.

ezherman commented 3 weeks ago

That's great to hear, thank you for working on this @lcoombe!

lcoombe commented 3 weeks ago

Quick update - I merged my fix to master branch! There's a more detailed explanation of what I found and how I fixed it in the PR here: #232 Hopefully it will fix the issue for you too - on my end, the resulting abundances were much closer to the expected. Of course, there is still variation in the simulated abundances, so they won't be exactly what was requested. Will updated again when I do the next release!

lcoombe commented 2 weeks ago

This fix was included in release v3.2.2!

ezherman commented 2 weeks ago

Thank you @lcoombe! I'll give the latest version a try as soon as I can, hopefully later this month.