aaranyue / quarTeT

A telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification
http://atcgn.com:8080/quarTeT/home.html
81 stars 6 forks source link

Advice for "-l" parameter setting #35

Open V-JJ opened 2 months ago

V-JJ commented 2 months ago

Hello!

We're trying to run quartet on "Rhynchospora pubera" genome, which is a holocentromeric species with hundreds of centromeres spread across the genome and delimited by Tyba tandem repeats of about 172 bp. Each array of Tyba is about 20kb and the average spacing between two consecutive arrays is about 360kb. Here is the paper reference.

Out idea was to run quartet by varying the "-l" parameter, but the number of predicted centromeric regions are similar, ~60 centromeres. The expected number is about 600 for most of the chromosomes.

Chr 10k 20k 30k 50k 100k 200k CM051459.1 63 62 61 61 60 59 CM051460.1 38 38 38 38 38 37 CM051461.1 43 43 43 43 42 42 CM051462.1 53 53 53 53 53 52 CM051463.1 39 39 39 38 38 38

I've counted the number of centromeric regions by checking the number of lines that start with the name of a given chromosomes in the *candidate files.

For example, 2 centromeric regions in this example file. Hope that it is correct to do it that way.

What do you think? Can we adjust any other quartet parameter? Any advice or suggestion would be helpful, thanks


# Chr   start   end     length  TRlength        TRcoverage
#       subTR   period  subTRlength     subTRcoverage   pattern
CM051463.1      104670288       105342098       671811  75942   11.3%
        CM051463.1@TR_01128     198     12913   1.92%   CAAAGTGAAATAATGCACAAAA...
        CM051463.1@TR_00137     174     10963   1.63%   ATGATTCATATCATAAAAAAAA...
        CM051463.1@TR_07164     159     9922    1.48%   AATATGATTCATATGAAAAAAA...
CM051463.1      104670288       105342098       671811  75942   11.3%
        CM051463.1@TR_04812     179     9600    1.43%   TTCTAAGTCATTTTATCACAAT...
        CM051463.1@TR_03582     192     9580    1.43%   ATTCTAGACAGAATAAAGAGTT...
        CM051463.1@TR_01323     172     8240    1.23%   ATGATGCTCAGAATTGCATTAT...
`
Echoring commented 2 months ago

Sorry, I'm on a business trip and cannot look carefully for now. -l parameter defines the minimal size of tandem repeat region to be selected as candidate. If you want to get more candidates, you'd better lower this parameter, and reduce -g parameter to split the region that may be predicted as whole. I'll take a closer look once I complete my bussiness.

V-JJ commented 2 months ago

Hi!

Thanks for fast response! We'll have a look at the "g" parameter.

All best,