Closed clb21565 closed 4 years ago
Hi @clb21565, you are right, the ecoli model we provided is only compatible with old versions of NanoSim (< V.2.0). This model was trained with reads sequenced from R9.1 flowcells. We suggest you use either use models trained with newer datasets (R9.4 or similar ones) as they are the current state of the art, or train your own model with the current version NanoSim. The training should be simple and straightforward, but feel free to contact us if you have trouble training and we can also do it for you.
ah, I see, thank you for clarifying!
thanks for the speedy reply!
It seems like the ONT consortium is using NA12878 human as their go-to reference, instead of E. coli now, so I'm not sure about any "good ones". But I just did a quick search and found a relatively new dataset here maybe you can check on that.
One last question for you: I am considering training my own model using a sequencing run from the Loman lab where they sequenced a zymo standard mock microbial community
altogether, they got about 16,506,755,978 bp on ~ 3 million reads - do you foresee any computational challenges to getting this size/scope of data through the training step?
it will be happening on a HPC cluster with up to 1 TB memory possible
We have done some tests with that dataset before, but we down-sampled it. In my opinion, it's enough to train a good model with 100k reads. But I also don't see any problem with the whole dataset, except that the runtime maybe longer because some steps are not parallelized yet.
May I ask why you choose to use the mock community? Do you plan to simulate microbial community as well?
Good to know! I may sub-sample, too, then.
And - we are doing an in silico spike in experiment for a metagenomic assembly benchmarking study
basically- we have experimental shotgun metagenomes from wastewater w/ limited sequencing depth
our plan is to simulate Nanopore reads from a gammaproteobacteria that should not be there, then spike in the fake reads to the metagenomes to evaluate downstream assembly
I was thinking that the mock community was a good target for the error profile because we're working with metagenomics.
so, no we're not simulating a microbial community, rather simulating one organism based on the error profile of the community
I see, that's an interesting idea! Then I guess you can first subsample your dataset to only reads assigned to gammaproteobacteria or other organisms with similar GC content?
Hmm, not sure about that - we will consider that! I think our plan was to leave the dataset as it was, then try to assemble de novo, then see if what comes out is what's expected.
One side note: we download the Loman lab's sequencing of the zymo log community. is it okay just to use the reference genomes of the community members as the reference genome for training?
We used all reference genomes for training, to avoid reads being forced to map to other species.
Hi all, thanks for the hard work. I noticed in the ftp zip files for the E. coli genomes, there are only these files : -rw-r--r-- 1 clb21565 cs6824_f19 69616 Feb 24 14:48 ecoli_aligned_length_ecdf -rw-r--r-- 1 clb21565 cs6824_f19 71424 Feb 24 14:48 ecoli_aligned_reads_ecdf -rw-r--r-- 1 clb21565 cs6824_f19 287373 Feb 24 14:48 ecoli_align_ratio -rw-r--r-- 1 clb21565 cs6824_f19 325 Feb 24 14:48 ecoli_del.hist -rw-r--r-- 1 clb21565 cs6824_f19 290 Feb 24 14:48 ecoli_error_markov_model -rw-r--r-- 1 clb21565 cs6824_f19 3729 Feb 24 14:48 ecoli_first_match.hist -rw-r--r-- 1 clb21565 cs6824_f19 324657 Feb 24 14:48 ecoli_ht_ratio -rw-r--r-- 1 clb21565 cs6824_f19 289 Feb 24 14:48 ecoli_ins.hist -rw-r--r-- 1 clb21565 cs6824_f19 2107 Feb 24 14:48 ecoli_match.hist -rw-r--r-- 1 clb21565 cs6824_f19 43795 Feb 24 14:48 ecoli_match_markov_model -rw-r--r-- 1 clb21565 cs6824_f19 217 Feb 24 14:48 ecoli_mis.hist -rw-r--r-- 1 clb21565 cs6824_f19 149 Feb 24 14:48 ecoli_model_profile -rw-rw-r-- 1 clb21565 cs6824_f19 170833 Jan 21 2019 ecoli_R91D_profile.zip -rw-r--r-- 1 clb21565 cs6824_f19 5591 Feb 24 14:48 ecoli_unaligned_length_ecdf
however, the simulation stage takes several files as input:
am I missing something here?
Thanks!