bcgsc / NanoSim

Nanopore sequence read simulator
Other
236 stars 57 forks source link

Add option to control error percentage #7

Closed Psy-Fer closed 6 years ago

Psy-Fer commented 8 years ago

Hey there,

Would it be possible to have an additional option to be able to control the average error rate in the simulated reads?

Usage being that I would like to create a range of simulated reads with varying error profiles in order to test the error tolerances of various other tools. This tool would be fantastic for this because it uses actual ONT characteristics.

Maybe by editing the following?

 with open(model_prefix + "_unaligned_length_ecdf", 'r') as u_profile:
        new = u_profile.readline().strip()
        rate = new.split('\t')[1]
        # if parameter perfect is used, all reads should be aligned, number_aligned equals total number of reads.
        if per or rate == "100%":
            number_aligned = number
        else:
            number_aligned = int(round(number * float(rate) / (float(rate) + 1)))
        number_unaligned = number - number_aligned
        unaligned_dict = read_ecdf(u_profile)

What do you think?

In the meantime, i'm just going to hack out some kind of solution, but it would be great if it was part of the tool.

Cheers

cheny19 commented 8 years ago

Sorry, I don't know how I missed your comment and just see it now.

We are also trying to add the option to control error rate, but failed with some attempts. The thing is the error rate is controlled by the length and position of each error, both are determined empirically as Markov Model and statistical mixture models. We still don't have a clear relationship between the error rate and these factors.

As to the code you are pointing to, that is about the alignment rate, namely how many reads can be aligned in one dataset. It's not the error rate of aligned reads.

Thanks for you advice.

Psy-Fer commented 8 years ago

Hello,

Ahh yes you are correct. My mistake :)

Would a close approximation error model be useful, that could be applied to the output to give more error than it currently does. I have just been introducing noise into the reads NanoSim creates and using the original read alignment, compared to the new "noisy" read alignment to calculate error.

Not all that great, but it kinda works. haha.

Regards,

cheny19 commented 8 years ago

For curiosity, how are you introducing noise to the output reads and how do you determine if it works or not?

Thanks!

Psy-Fer commented 8 years ago

Using numpy.random.choice http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html

switching a base out for another, inserting or deleting but at lower probability. Also use some list for loops for poly runs of AAAA or TTTT and add or remove a letter or 2.

Probably not all that representative of "real" error that the MinION generates. Not sure how to validate, But was looking at which point some tools "break" depending on how messy the data could get.

Kind of moved on to just using the event data and dynamic time warping for some other project. But will no doubt be using something like NanoSim in the future