Estimate length of long homopolymers in DNA

Psy-Fer / SquiggleKit

SquiggleKit: A toolkit for manipulating nanopore signal data

MIT License

122 stars 23 forks source link

Estimate length of long homopolymers in DNA #63

Open dweemx opened 8 months ago

dweemx commented 8 months ago

Hello, Is it possible to use SquiggleKit to estimate the length of homopolymers in each read?

I guess I would have to use the Segmenter tool together with SquigglePull? However I'm not too sure about the options to use. It would be much appreciated if you could help me with this

Psy-Fer commented 8 months ago

Hello,

Yes segmenter would probably be the tool to use. Estimation could be done roughly from sequencing speed and sampling rate, although it would be pretty rough estimate.

James

dweemx commented 8 months ago

Ok, thank you for prompt response. I guess that would be still a better estimate than estimating the homopolymers length at the read level?

Is the tool (Segmenter) agnostic on the chemistry (e.g. Kit 14) ?

For running the Segmenter.py, I'm not familiar with the different parameters that can be set. I expect my long poly(A/T) to be between 80bp and 200bp. Do you suggest to adapt some of the default values of the parameters?

I read the documentation but not too sure about setting following parameters for estimating homopolymers:

-k --stall | False
-j --stall_start | 300
-g --gap | False
-b -gap_dist

If I understand well, the output of the Segmenter.py is a tsv file that will contain the sample window of the homopolymers which I will then use together with the sequencing speed and sampling rate?

Psy-Fer commented 8 months ago

Hey,

Ahh yes something like that. Let me look at that and get back to you. Need to have a look at it, it's been a few years...

Also are you going from fast5, slow5, or pod5 files?

James

dweemx commented 8 months ago

I have pod5 files (but otherwise I could also work with fast5 files with a simple conversion)

Thanks a lot for looking into this

Psy-Fer commented 8 months ago

Hey, do you have an example read you could share with me that has one of these homopolymers in it?

It would help me a lot with what the expected bounds should be for detection. Originally, segmenter was designed around R9.4.1 cDNA (using our RAGE-seq 10X single cell data). So it's probably worth the time letting me run a read from regular DNA with a homopolymer example you want to measure so I can make sure everything is in order.

James