cancerit / telomerecat

Telomerecat: The telomere computational analysis tool
GNU General Public License v3.0
17 stars 5 forks source link

Telomere length is zero #35

Closed kacyng92 closed 2 years ago

kacyng92 commented 3 years ago

Hi, I've been using Telomerecat to run some WGS data, and for some samples, the returned length estimated is 0. When compared with successful runs for other samples, I noticed that if F2c value is negative, the output will be 0. Based on my understanding, F2c is the difference between F2 and F4 read (F2 - F4), and a negative F2c means the amount of non-telomeric boundary sequence (F4) is more than telomeric boundary sequence (F2). Is it possible that F1 is positive and F2 negative? And how should I troubleshoot this issue? Thank you very much! Your help is greatly appreciated. Regards,

PedalheadPHX commented 3 years ago

Just found this tool as I was reading (Moore, L., Cagan, A., Coorens, T.H.H. et al. The mutational landscape of human somatic and germline cells. Nature (2021). https://doi.org/10.1038/s41586-021-03822-). But they have an odd comment in the paper but no justification "For samples that were sequenced using the NovaSeq sequencing platform, the results using Telomerecat were occasionally implausible (such as telomere length estimates of 0 bp)."

So is this NovaSeq data?

tibutler commented 3 years ago

The issues seems specific to the ends of novaseq reads. Using the -t 75 option seems to fix this to some degree. It will only use the first 75 bp of a read, thereby removing the ends of the reads which have an excess of "G's" which seems to be the issue. Using this same option on X10 data slightly shortens the telomere length estimates, but allows for comparison between samples sequenced on different platforms.

PedalheadPHX commented 3 years ago

I'd assume this correlates with the estimated insert size as the excess G usually reflects when a sequencing read as crossed the insert and the adaptor on the other side so there is no DNA template, which then produces no signal but on the 2-color Illumina systems is considered a G.

Otherwise maybe this is a NovaSeq specific error profile from the telomere repeat sequence.

tibutler commented 3 years ago

If you run fastqc on the resulting telbam files for X10 vs Novaseq data, you can see a clear increase in proportion of Gs after cycle 100 or so, but only in the Novaseq data. Which I agree is a 2-colour issue.

we also found using the -e option on a per-sequencing run basis also helped normalise between samples. That is, group samples run in the same batch, then use -e when generating the length estimate.

PedalheadPHX commented 3 years ago

Must admit I had not caught that increase in Gs, just reviewed one of our most recent runs and what I'm noticing is the increase in Gs is most notable in R2 than R1, so it must reflect some deterioration of the signal intensity on R2 after turn around

R1: R1_Example

R2: R2_Example

kacyng92 commented 3 years ago

Thank you for your reply; I am playing around with the -t option to see how the effect of excess G can be lessened

kacyng92 commented 3 years ago

"For samples that were sequenced using the NovaSeq sequencing platform, the results using Telomerecat were occasionally implausible (such as telomere length estimates of 0 bp)."

Yes, I have both HiSeq and NovaSeq data.