ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

Segmental duplications #1458

Open jonahcullen opened 3 months ago

jonahcullen commented 3 months ago

Hello, I am trying to replicate the generation of the Year1 centromeric satellite and segmental duplication annotations as described for usage with the contig.inclusion.stats.R. If I understand correctly, the centromeric annotations are produced with dna-brnn as part of cactus-preprocess (--maskMode brnn). What mask action was chosen? I am guessing I am just missing it due to my unfamiliarity, but where/how are the segmental duplications marked in the sedef.bedpe files? Is that with sedef or now biser? And what if anything was done following sedef/biser (?) to generate for example HG00438.maternal.sedef.bedpe.

Thanks for your time, Jonah.

glennhickey commented 3 months ago

dna-brnn was run with its default settings. From the HPRC paper https://www.nature.com/articles/s41586-023-05896-x#Sec120

SD annotation

SDs were annotated using sedef85 after masking repeats in each assembly. Repeats annotated with more than 20 copies corresponded to unannotated mobile elements and were excluded from the analysis. The pipeline for annotating SDs is available at GitHub (https://github.com/ChaissonLab/SegDupAnnotation/releases/tag/vHPRC).

I think the segdupe data may live here

https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/0175F9C0-83B5-4CA3-9256-EC0593490EE7--repeats-and-segdups/