ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

MAF sequence names #1472

Open austinchipps opened 2 months ago

austinchipps commented 2 months ago

Hi ya'll!

I used Progressive Cactus to align whole genomes of 8 species. Then, using cactus-hal2maf with the --bedRanges option I was able to output an alignment including only genes of interest. When I inspect the MAF file however, the only names attached to sequences are the species.chromosome i.e., Mus_musculus.NC_000072.7. This has become problematic in downstream applications because some of the different genes are on the same chromosome and therefore the sequences have the same name. So even things like AlignIO from Biopython can't read the MAF file. Is there any way around this issue using the tools provided by Cactus?

Thank you in advance for your help!

glennhickey commented 2 months ago

the only names attached to sequences are the species.chromosome i.e., Mus_musculus.NC_000072.7

What more do you want?

austinchipps commented 2 months ago

Hey Glenn,

Apologies if my original question wasn't clear. So in my MAF file I have multiple aligned regions (genes in this case) from the same chromosome. So alignments of 'gene1' and of 'gene2' are going to have the same name if they come from the same chromosome. Chromosome name in my example above is "Mus_musculus.NC_000072.7". Since I have a BED file of target genes including their names in addition to the coordinates, I was wondering if ya'll had a tool or recommended way to distinguish alignments of different genes from the same chromosome. My goal is to eventually split the alignment (after converting to fasta) so I end up having one alignment file for each aligned gene. I haven't been able to accomplish this yet because the names of what's aligned are repetitive.

Austin