Potential Centrominer improvements using proximity to telomere scoring, and gene annotation

colindaven commented 6 months ago

Hi,

thanks for this toolset. I've only been interested in Centrominer which is great for Arabidopsis, but suffers greatly on other genomes, even Bean, which is only about 500mb.

The problem I see is that centromeres are frequently predicted in telomeric regions, so in the first or last megabase or so of sequence. This is obviously wrong and can be observed in over 50% of chromosomes. Often the predictions start literally at base 1. I have attempted to mitigate this without much success by taking 300kbp, not 100 kbp, as the min_length.

Suggestion 1 - down-vote those near telomeres/chromosome ends ?

Suggestion 2 - use genes/region or window

I noticed that centromeres are (obviously) not gene rich, and often there is a gradiation and gradual reduction in gene content from gene rich regions in the chromosome to the centromeres. I wonder if it might be possible to use a simple de novo gene finder like Augustus and integrate the number of genes per region into the centromere prediction.

Thanks again

Echoring commented 6 months ago

Thank you for your advises. We have once writen a developing version before release which have gene content down voting function, but its performance was not ideally. In fact, the result is named "candidate" because we have planned to create a "CentroVoter" module to vote these candidates. However, to meet the graduation requirement, we published the toolkit in an incompleted state. And after that we have been busying for other studies, leaving this work in progress. It will cost a while to continue this work. For now, we mainly ignore the "best candidate", directly check several high-score region and compared with gene content, Hi-C heatmap, pairwise colinearity etc. to find a reasonable result.

colindaven commented 6 months ago

I see, thanks for your comments, that's very understandable. Well, it might be an option to just present this information to the users in the README.md and say manual selection is highly recommended. Also, rather than the best candidate, it might be better to present the top 5 or top 10 candidates and encourage the user to make the choice ? And/or also just eliminate candidates within say 1000bp of the chromosome ends, or even those just overlapping coordinate 1.

Combining the TRF results in GFF3 format and using them in a genome browser like Jbrowse is simple enough really - but is an undocumented approach.

aaranyue / quarTeT

Potential Centrominer improvements using proximity to telomere scoring, and gene annotation #28