ComparativeGenomicsToolkit / hal

Hierarchical Alignment Format
Other
164 stars 39 forks source link

masking sites in target genome after alignment #303

Open JeffWeinell opened 5 months ago

JeffWeinell commented 5 months ago

I have an alignment of 58 snake genomes stored as a HAL file and generated using Progressive Cactus. For each genome in the alignment, I have a BED file specifying site positions in the ungapped genome that I want to be hard-masked (with Ns) in an updated alignment.

The example below illustrates what I am trying to do.

Input files that I have:

(1) An alignment (portrayed here as an alignment block with dummy data for simplicity).

genome1.seqABC  CATAATT----CACCACTCGCACCAGGACGAAAAACGTATTCTTgctgacgcgtttcttatt
genome2.seqXYZ  cataattcaTCCACCACTCGCAccagGACGAAAAACGT------gctgacgcgtttcttatt

(2) BED file (dummy data) with regions of ungapped genome2 that I want to be hard-masked in the updated alignment.

seqXYZ  0   9
seqXYZ  22  26

Desired updated alignment

After hard-masking the target genome sites in the BED file, the updated alignment includes unmasked, soft-masked, and hard-masked sites:

genome1.seqABC  CATAATT----CACCACTCGCACCAGGACGAAAAACGTATTCTTgctgacgcgtttcttatt
genome2.seqXYZ  NNNNNNNNNTCCACCACTCGCANNNNGACGAAAAACGT------gctgacgcgtttcttatt

I would greatly appreciate any help with how to solve this problem!

-Jeff

glennhickey commented 5 months ago

I don't think HAL has any tools that allow you to modify the sequences. Your best bet is probably to export to MAF then do the masking with your own script. The taffy python API can parse MAF files and may be helpful for this.

JeffWeinell commented 5 months ago

Thanks!

I have the alignment also in a MAF file (converted using cactus-hal2maf), but I ran into the same problem (no obvious tool for the job) as when starting with the HAL file. The programs taffy, maf_parse (implemented in PHAST), and MafFilter seemed promising, but as far I can tell they won't do what I need either.

I can't be the only person that has needed to do this. If I come across a solution elsewhere, I'll share it here.

Thanks again, -Jeff