missing files in pre-trained e. coli models?

clb21565 commented 4 years ago

Hi all, thanks for the hard work. I noticed in the ftp zip files for the E. coli genomes, there are only these files : -rw-r--r-- 1 clb21565 cs6824_f19 69616 Feb 24 14:48 ecoli_aligned_length_ecdf -rw-r--r-- 1 clb21565 cs6824_f19 71424 Feb 24 14:48 ecoli_aligned_reads_ecdf -rw-r--r-- 1 clb21565 cs6824_f19 287373 Feb 24 14:48 ecoli_align_ratio -rw-r--r-- 1 clb21565 cs6824_f19 325 Feb 24 14:48 ecoli_del.hist -rw-r--r-- 1 clb21565 cs6824_f19 290 Feb 24 14:48 ecoli_error_markov_model -rw-r--r-- 1 clb21565 cs6824_f19 3729 Feb 24 14:48 ecoli_first_match.hist -rw-r--r-- 1 clb21565 cs6824_f19 324657 Feb 24 14:48 ecoli_ht_ratio -rw-r--r-- 1 clb21565 cs6824_f19 289 Feb 24 14:48 ecoli_ins.hist -rw-r--r-- 1 clb21565 cs6824_f19 2107 Feb 24 14:48 ecoli_match.hist -rw-r--r-- 1 clb21565 cs6824_f19 43795 Feb 24 14:48 ecoli_match_markov_model -rw-r--r-- 1 clb21565 cs6824_f19 217 Feb 24 14:48 ecoli_mis.hist -rw-r--r-- 1 clb21565 cs6824_f19 149 Feb 24 14:48 ecoli_model_profile -rw-rw-r-- 1 clb21565 cs6824_f19 170833 Jan 21 2019 ecoli_R91D_profile.zip -rw-r--r-- 1 clb21565 cs6824_f19 5591 Feb 24 14:48 ecoli_unaligned_length_ecdf

however, the simulation stage takes several files as input:

training_aligned_region.pkl Kernel density function of aligned regions on aligned reads
training_aligned_reads.pkl Kernel density function of aligned reads
training_ht_length.pkl Kernel density function of unaligned regions on aligned reads
training_besthit.maf/sam The best alignment of each read based on length
training_match.hist/training_mis.hist/training_del.hist/training_ins.hist Histogram of match, mismatch, and indels
training_first_match.hist Histogram of the first match length of each alignment
training_error_markov_model Markov model of error types
training_ht_ratio.pkl Kernel density function of head/(head + tail) on aligned reads
training.maf/sam The alignment output
training_match_markov_model Markov model of the length of matches (stretches of correct base calls)
training_model_profile Fitted model for errors
training_processed.maf A re-formatted MAF file for user-provided alignment file
training_unaligned_length.pkl Kernel density function of unaligned reads
training_error_rate.tsv Mismatch rate, insertion rate and deletion rate
training_strandness_rate Strandness rate in input reads.

am I missing something here?

Thanks!

cheny19 commented 4 years ago

Hi @clb21565, you are right, the ecoli model we provided is only compatible with old versions of NanoSim (< V.2.0). This model was trained with reads sequenced from R9.1 flowcells. We suggest you use either use models trained with newer datasets (R9.4 or similar ones) as they are the current state of the art, or train your own model with the current version NanoSim. The training should be simple and straightforward, but feel free to contact us if you have trouble training and we can also do it for you.

clb21565 commented 4 years ago

ah, I see, thank you for clarifying!

I will try to train my own, but I am struggling to identify some good reference data. Specifically - it'd be nice to have reference / raw data for an E. coli genome on a 1D R9.4 run - know where I might find something like that?

thanks for the speedy reply!

cheny19 commented 4 years ago

It seems like the ONT consortium is using NA12878 human as their go-to reference, instead of E. coli now, so I'm not sure about any "good ones". But I just did a quick search and found a relatively new dataset here maybe you can check on that.

clb21565 commented 4 years ago

One last question for you: I am considering training my own model using a sequencing run from the Loman lab where they sequenced a zymo standard mock microbial community

altogether, they got about 16,506,755,978 bp on ~ 3 million reads - do you foresee any computational challenges to getting this size/scope of data through the training step?

it will be happening on a HPC cluster with up to 1 TB memory possible

cheny19 commented 4 years ago

We have done some tests with that dataset before, but we down-sampled it. In my opinion, it's enough to train a good model with 100k reads. But I also don't see any problem with the whole dataset, except that the runtime maybe longer because some steps are not parallelized yet.

May I ask why you choose to use the mock community? Do you plan to simulate microbial community as well?

clb21565 commented 4 years ago

Good to know! I may sub-sample, too, then.

And - we are doing an in silico spike in experiment for a metagenomic assembly benchmarking study

basically- we have experimental shotgun metagenomes from wastewater w/ limited sequencing depth

our plan is to simulate Nanopore reads from a gammaproteobacteria that should not be there, then spike in the fake reads to the metagenomes to evaluate downstream assembly

I was thinking that the mock community was a good target for the error profile because we're working with metagenomics.

so, no we're not simulating a microbial community, rather simulating one organism based on the error profile of the community

cheny19 commented 4 years ago

I see, that's an interesting idea! Then I guess you can first subsample your dataset to only reads assigned to gammaproteobacteria or other organisms with similar GC content?

clb21565 commented 4 years ago

Hmm, not sure about that - we will consider that! I think our plan was to leave the dataset as it was, then try to assemble de novo, then see if what comes out is what's expected.

One side note: we download the Loman lab's sequencing of the zymo log community. is it okay just to use the reference genomes of the community members as the reference genome for training?

cheny19 commented 4 years ago

We used all reference genomes for training, to avoid reads being forced to map to other species.

bcgsc / NanoSim

missing files in pre-trained e. coli models? #77