PacificBiosciences / trgt

Tandem repeat genotyping and visualization from PacBio HiFi data
Other
104 stars 8 forks source link

reference coordinates of pathogenic repeats #28

Open bw2 opened 6 months ago

bw2 commented 6 months ago

Comparing https://github.com/PacificBiosciences/trgt/blob/main/repeats/pathogenic_repeats.hg38.bed to https://github.com/broadinstitute/str-analysis/blob/main/str_analysis/variant_catalogs/variant_catalog_without_offtargets.GRCh38.json there are some differences in the start and end coords.

Assuming TRGT input format is 0-based for the start coordinate, would it make sense to change the coordinates in pathogenic_repeats.hg38.bed as follows?

C11ORF80 (start -= 2)
AFF3   (start -= 1)
HOXD13 (end -= 1)
LRP12 (start -= 2)
MARCHF6 (end -= 2)
SAMD12 (start -= 3)
SOX3  (end -= 1)

Also, it might be worth adding these loci:

ABCD3
EIF4A3
FGF14 
PRDM12
PRNP 
RILPL1 
TBX1 
THAP11
VWA1 
ZFHX3
ZIC3 
egor-dolzhenko commented 6 months ago

Thank you for bringing this up! We will add the loci you mentioned and start working on evaluating / improving reference coordinates of known pathogenic repeats.

hdashnow commented 6 months ago

I've been using the STRchive loci for this. We automate generation of these when the database gets updated. https://github.com/dashnowlab/STRchive/blob/main/data/hg38.STRchive-disease-loci.TRGT.bed