ChaissonLab / vamos

VNTR annotation using motif selection
GNU General Public License v2.0
29 stars 5 forks source link

edit distance large #15

Open Hanjunmin opened 1 week ago

Hanjunmin commented 1 week ago

Hi, Thank you very much for developing this tool! I find some regions motif edit distance is too large: chr1 7105232 7105273 GAGCTGG,GAGCCAA,GGGCTGG,GAGCTGC,GAGCTCT,GAGCTG "GAGCTGG","GAGCCAA".the edit distance is 3, but the length only 7, I think it has a big divergence ,so did this issue occur because you kept iteratively adding motifs, or was this deviation allowed from the beginning?

Hanjunmin commented 1 week ago

Perhaps these additional motifs with larger edit distances are from other samples? AND you only consider the edit distances of motifs under the reference genome within a certain range?

mchaisso commented 6 days ago

Hi Hanjunmin, can you be more clear with what you are doing, and what you expect? vamos very specifically builds motif databases of diverse motifs.

Best, -Mark

BidaGu commented 5 days ago

To supplement Mark's comments, this deviation is allowed from the beginning. We first generate tandem repeat calls for a panel of input genomes (specifically the 94 HPRC genomes) using Tandem Repeat Finder and RepeatMasker with their default parameters. Our tandem repeat boundaries and motif catalogs are refined based on these multi-genome calls.

We do not filter raw tandem repeat calls from TRF/RepeatMasker based on sequence purity. Instead, any tandem repeat region supported by a sufficient number of genomes (in this case, at least three) is considered a valid tandem repeat region. More details about the catalog-building pipeline can be found in our recent manuscript: https://www.biorxiv.org/content/10.1101/2024.08.07.607105v1.

For the chr1 example, this is a fairly conserved low-purity repeat region across all genomes (and is also present in the T2T reference). Every genome marks this region as a tandem repeat via TRF/RepeatMasker, so we’ve kept it in our catalog, even though its motifs are somewhat divergent.