bayesomicslab / ONT-nonb-GoFAE-DND

Deep statistical model for predicting non-B DNA structures from ONT sequencing
3 stars 0 forks source link

Example of Experimental Data to Use as Input For Model #5

Open alexturcoo opened 8 months ago

alexturcoo commented 8 months ago

Hello there,

I am currently trying to utilize this model to detect non-B structures from ONT data. I am struggling to understand how the input data should look for experimental data. I am utilizing ONT's most recent open source base caller so my preprocessing workflow for experimental data is different than the albacore + tombo workflow followed in your paper. I had no issues producing the simulated data and through inspecting this simulated data, the data made lots of sense to me. Does this simulated data produced by the simulator relate to how the input data for experimental data should look, in terms of features? Is there any example of how the input data produced from experimental samples should look? This would help me alter my preprocessing to fir my workflow. Please let me know! Thanks

Marjan-Hosseini commented 8 months ago

Hi, The input for our model doesn't necessarily have to be preprocessed by Albacore and Tombo. We only need translocation times per base for a region on a chromosome. You can use more recent pipelines like dorado and f5c/nanopolish or any other software that is able to produce per base event times.

alexturcoo commented 8 months ago

@Marjan-Hosseini. In the simulated training data, there are columns for each forward base, reverse base, and masked base. When I am using experimental data should the training data also include the bases for the negative strand? What if I only have positive strand reads and the matching window on the negative strand are not present. Can I utilize a data frame with forward and reverse strand base translocation times in different rows? What I mean by this. For the same read, it is not always the case that the read is present on both strands. Is it okay to not have reverse base reads in the same row as the forward reads? Thanks.

Marjan-Hosseini commented 6 months ago

Yes, the training data includes the signal in the reverse strand as well. If the reverse strand is not available you may use the forward signal instead, just to make the input signal as required by the model, but I would not recommend, because probably the model wouldn't perform as expected. I'm curious to see the results if you are doing so.