telomere and nanopore chimeric sequences

christinawu2008 commented 4 years ago

Hi there,

We are wondering how Shasta assembles reads containing telomere sequences? We have an assembly showing that short telomere repeats (less than 1k length) exist in the middle of some scaffolds. Also, nanopore has chimeric sequences, such as those may contain telo-like repeats in the original reads. These problems make trouble to produce chr-level assembly...... How does the assembler deal with these problems?

Thanks! Chen

paoloczi commented 4 years ago

Yes, in its current status Shasta is not able to assemble telomere sequence. I am hoping to be able to devote some time later this year to developing algorithm improvements that permit assembly of centromeres and telomeres.

christinawu2008 commented 4 years ago

Thanks for your response! When you say "not able to assemble telomere sequence", does it mean it can't produce a telomere signal towards to the end of scaffolds? We don't need a full telomere seq, but do want to get scaffolds showing telo-like repeats at ends, which gives us more confidence on the assembly. Also - do you recon chimeric read breaking/removing before running Shasta may be better?

Thanks! Chen

christinawu2008 commented 4 years ago

Another question is that did you see raw reads containing telomeres in the middle from human data? Thanks!

paoloczi commented 4 years ago

When you say "not able to assemble telomere sequence", does it mean it can't produce a telomere signal towards to the end of scaffolds? We don't need a full telomere seq, but do want to get scaffolds showing telo-like repeats at ends, which gives us more confidence on the assembly.

It may be able to assemble a few kb of telomeric sequence at the beginning/end of each chromosome, but I never checked on that.

Do you recon chimeric read breaking/removing before running Shasta may be better?

I don't think so. Compared to other assemblers, Shasta is quite conservative and tends to not assemble sequence rather than making assembly errors. Shasta does include a step of chimeric read removal, even though that is not covered in the documentation.

Did you see raw reads containing telomeres in the middle from human data?

No, but I was not looking. Some of may colleagues may have better ideas about this - let me see what I can find out.

If telomeric sequence in the middle of reads were occasionally present, and assuming that this does not happen consistently at the same genome location for several reads, the assembly process would effectively implicitly discard the reads in question.

christinawu2008 commented 4 years ago

Very well, thanks for your explanation, saving us lots of time! 👍

chanzuckerberg / shasta

telomere and nanopore chimeric sequences #124