bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Options / suggestions for how to simulate nCats data? #200

Open tfenne opened 7 months ago

tfenne commented 7 months ago

Hi - I'm trying to simulate data similar to that generated by the nCATS protocol.

What this means is that I would like to be able to specify e.g. one or more small regions (on the order of 1-50bp) where all reads should start, rather than start positions being randomly distributed throughout the genome.

I don't see any options to constrain the read locations, so I'm thinking that what I'll have to do is: i) Generate small FASTA files that start where I want reads to start and extend for 100-200kb ii) Simulate a lot of reads from that file iii) Filter the simulated reads to only those that start within the region I want

I'm guessing (ii) and (iii) will be rather slow, and I'm wondering if you have better suggestions for how to proceed? Thanks!

SaberHQ commented 6 months ago

Thank you @tfenne for using NanoSim.

NanoSim currently does not have such a feature. It would be interesting to explore adding that in future releases. However, I can not give you a guaranteed answer whether or not we will work on it and an approximate timeframe for it.

In the meantime, I would suggest you follow the approach you suggested, generating a lot of reads and then filtering them based on their location. NanoSim is fairly fast in generating reads and you should be able to get millions of reads generated within a day.

I will keep you updated on this. Thanks, Saber.