alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
89 stars 51 forks source link

customizing positional and GC biases #60

Open Congm12 opened 5 years ago

Congm12 commented 5 years ago

Hi,

I really enjoyed the positional and GC bias feature in polyester. However, I would like to incorporate more customized positional and GC bias models to make capture more potential bias scenarios in my simulation. I wonder whether there is way to do that?

I notice the rnaf_pos_bias.csv and cdna_pos_bias.csv for positional biases. I think generating files for my customized bias models in this format is a way towards incorporating customized biases. But I am not sure how to make polyester use the new ones instead of the default ones. As for GC bias, do such bias model files exist so that I can substitute to my own ones?

Best, Cong

warrenmcg commented 5 years ago

Hi @Congm12,

There is currently no way to do custom positional bias. What other scenarios would you consider besides cDNA or RNA fragmentation, since those are the most common protocols for fragmenting the library and the step that generates positional bias?

However, there is a way to add custom GC bias (fragment-level only), custom fragment length distribution, and custom platform error. See the Details section of ?polyester::simulate_experiment for details.

For fragment-level GC bias, you can supply your own matrix that is 101 rows (representing 0%, 1%, ... 100% GC content), and columns equal to the number of samples. At each entry in this matrix is the probability of a fragment with the given GC content showing up in the given sample. You would set the frag_GC_bias argument in simulate_experiment with this.

For custom fragment length distributions, you would need to provide a density fitted using logspline to custdens argument, and set distr to custom.

For custom platform error, you can run GenSIM as described in the vignette to get an error model from data that would be provided to the error_model argument. If you want to create an arbitrary one from scratch, you should run data("ill100v5_single", package = "polyester") in an R session and look at the expected format. Basically, you need 5 rows for each position (so 505 for a read length of 100, with the first set being position 0), and each row needs to describe the probability of observing each base given the reference/"true" base. Obviously, then, each row's probabilities show add up to 1 (or close to it, if you account for floating point errors).

There might be motivation to include the option for a custom positional bias if you gave more detail about what you were looking to do. Otherwise, I think the other customizable options should probably be sufficient.

Hope that answers your question!

Congm12 commented 5 years ago

Thanks for the answer!

I'm only considering cDNA and RNA, but I'm not sure whether there can be differences in the positional bias of different sequencing machines, sequencing centers. If the positional biases are pretty consistent among samples, then I won't need to customize this option.

warrenmcg commented 5 years ago

You're welcome!

Positional bias is a consequence of the library preparation protocol, specifically any fragmentation steps that are done (e.g. this paper) and a combination of factors related to the reverse transcription step (e.g. this paper). The largest variation is going to be between whether cDNA or RNA fragmentation was done; you may see differences based on the other parameters (what temperature, how long, etc), but you would have to look at biases that come out from just those parameters alone. Here was a paper that was trying to combine multiple sources of bias into one model.

You may see positional differences between centers, but again, it's dependent on the protocol, and I think it would be captured reasonably well under the cDNA/RNA models currently available. It is theoretically possible to make positional bias customizable, but it's unlikely that it will be incorporated quickly because the polyester team is on a "maintenance-only" development cycle right now. If you think that this is needed based on your observations or literature search, then feel free to report back.

The other kinds of biases that would be due to different sequencers would be more appropriate for the platform error and GC biases described above, which are already customizable.

Hope that helps!