How to get the 'pangenome' of all mammals?

ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs

Other

529 stars 111 forks source link

How to get the 'pangenome' of all mammals? #1150

Open liu-wq opened 1 year ago

liu-wq commented 1 year ago

Hi,

I conducted a Cactus alignment of 30 mammalian genomes. I want to obtain a sequence similar to Pangenome, which contains insertions of all species. I want to convert the genome coordinates of each species into the coordinates of this sequence. How to achieve it?

For example, I want to convert the coordinates of each mammal's methylation site into the coordinates of a consistent sequence. But if I specify a species as a reference, some sequences will also be lost.

Best！ Weiqiang

glennhickey commented 1 year ago

I'm assuming you're aligning different species with Progressive Cactus? And you want some kind of "consensus" sequence for all inputs?

I don't think we have a way to do this. If you're willing to regenerate the alignment, you might be able to turn down "minimumBlockDegree" to "1" (default value is "2") in "cactus_progressive_config.xml". This would make sure each ancestor contains all sequences below it, which should give you what you want at Anc0. But it may bias your alignment (or not work as I expect).

It's an interesting question, and there's probably a way to come up with something using HAL at a fairly low level, but I don't really have anything to recommend now.

liu-wq commented 1 year ago

Thank you for your reply!

It may take another month to regenerate the alignment :(

I used the command hal2maf --global-- noDupes-- keepEmptyRefBlocks to convert the HAL file to the MAF file.

I wonder if it is possible to get an MAF file without specifying a reference genome, and the sequence of each species is complete?

If it is.

I should probably be able to generate a consensus sequence for each block by perl script?

Best Weiqiang