bioinfomaticsCSU / deepsignal

Detecting methylation using signal-level features from Nanopore sequencing reads
GNU General Public License v3.0
108 stars 21 forks source link

Genome sequence in feature extraction #52

Closed PanZiwei closed 3 years ago

PanZiwei commented 3 years ago

Hi, I had a question on the genome sequence you are using for feature extraction and would really appreciate it if you can provide more information.

I noticed that in line 221 of your extract_features.py you concatenated the base in the event after Tombo re-squiggle as the genome sequence, however, when I checked the nucleotide sequence from the /Analyses/Basecall_1D_000/BaseCalled_template/fastq group in the single-read fast5 file of your example fast5s.sample.tar.gz, I found the sequence in fastq is different from the base from Tombo event. So can you explain more about the relationship between the base in the event after Tombo re-squiggle and the fastq sequence basecalled by Albacore as you mentioned?

Also, there are two basecall information saved in your example fast5 files. You mentioned that --Basecall_1D_000 group is the Albacore result, how about the --Basecall_1D_001 one?

Thank you so much for your help!

Best, Ziwei

PengNi commented 3 years ago

Hi Ziwei,

(1) The fastq sequence in Basecall_1D_00x is the sequence of the read called by a basecaller. (2) After tombo re-squiggle, the raw signals of the read are actually aligned to the genome reference. So the sequence from tombo event is a region of genome reference where the read is aligned to. (3) I didn't find Basecall_1D_001 group in my example fast5 files. I can't explain which basecall result this is. I can check if you provide a fast5 file which contains Basecall_1D_001 group.

Best, Peng

PanZiwei commented 3 years ago

Hi Ziwei,

(1) The fastq sequence in Basecall_1D_00x is the sequence of the read called by a basecaller. (2) After tombo re-squiggle, the raw signals of the read are actually aligned to the genome reference. So the sequence from tombo event is a region of genome reference where the read is aligned to. (3) I didn't find Basecall_1D_001 group in my example fast5 files. I can't explain which basecall result this is. I can check if you provide a fast5 file which contains Basecall_1D_001 group.

Best, Peng

Hi Peng, Thanks for the response. In your readme file you gave the tombo resquiggle usage example: tombo resquiggle fast5s.al GCF_000146045.2_R64_genomic.fna --processes 10 --corrected-group RawGenomeCorrected_001 --basecall-group Basecall_1D_000 --overwrite

So to my understanding the re-squiggle utilizes the sequence information from Guppy (saved in Basecall_1D_000 group), so for the same read, I thought the sequence from Guppy and sequence in the event are able to map to the same region. Or they shouldn't map to the same region of genome since re-squiggle correct the signal andmay influence the mapping?

PengNi commented 3 years ago

Hi Ziwei,

The sequence from the tombo event table is actually a region from genome reference.

Resquiggle has two steps generally. 1. Use minimap2 to map the read to genome reference. 2. Map the raw signals to the region of genome where the read is aligned to.

PanZiwei commented 3 years ago

Hi Ziwei,

The sequence from the tombo event table is actually a region from genome reference.

Resquiggle has two steps generally. 1. Use minimap2 to map the read to genome reference. 2. Map the raw signals to the region of genome where the read is aligned to.

Hi Peng, Thank you so much for the explanation! It definitely answered my question.

Thanks again for your help!