bioinfomaticsCSU / deepsignal

Detecting methylation using signal-level features from Nanopore sequencing reads
GNU General Public License v3.0
108 stars 21 forks source link

Get position of targeted base in reads event table #17

Closed pterzian closed 4 years ago

pterzian commented 4 years ago

Hello PengNi,

I would like to investigate the signal stored in the event table of resquiggled reads. In that extent I thought starting from the extracted_features file would be my best shot. However I miss one information, it is the position of the targeted base in the event table of each reads. For example this could be the start and end position of the 17 k-mer in the event table.

Do you know how/where I could recover this information ?

Could it be added to the extracted_features files ?

Thanks a lot,

Paul

PengNi commented 4 years ago

Hi Paul @pterzian ,

(1) I'm not sure if I totally get what you mean. However, if you want to use the extract_features module, I think the easiest way is to change loc_in_ref to loc_in_read in extract_features.py L234. loc_in_read means the position of targeted base in the aligned read after re-squiggle. After you change it, you can rebuild and install the package. Then the 4th column of the output file may be what you want.

(2) We only output at most 360 signals for a 17-mer by default. Also, the output signals are normalized.

(3) Maybe you can also check out the tombo package to see if it has what you want.

Best, Peng

pterzian commented 4 years ago

Thank you for the hints, I will try this loc_in_read solution. I guess what would be the more useful to me is to get a new column in the event_table with the genomic position for each nucleotides and resquiggled signal (I guess the informations must be stored somewhere in some kind of cigar line). I'll see if I can do it from tombo's code but I thought starting from your function because I find pretty useful.

best, Paul

PengNi commented 4 years ago

@pterzian ,

fwiw, you can check out the _get_label_raw() function in extract_features.py L184. It returns the raw_signals array and the position information of each nucleotides in a format (start, end, base).