How does Cactus treat repeat/soft-masked regions

brettChapman commented 4 years ago

Hi

In the Cactus paper (https://www.biorxiv.org/content/10.1101/730531v3) it mentions that it expects the input genomes to be soft-masked. I was wondering how Cactus treats the soft-masked regions.

I've consulted with authors of minimap2 and VG and both recommend to not mask the genomes for whole genome alignment. My intention is to perform a 20 genome alignment with Cactus (see guide tree attached), and pull the alignments from Cactus into VG for further alignment of genomics reads of non-assembled varieties to the pangenome, and also for visualisation and exploratory analysis using tools such as sequenceTubeMap (https://academic.oup.com/bioinformatics/article/35/24/5318/5542397) and MoMIG (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3145-2). If others recommend no repeat masking, I was wondering what do the authors of Cactus recommend? Does Cactus still align through the soft-masked regions, but treat those alignments differently (perhaps considering only unique alignments?). My genomes are roughly 70-80% repeat masked based on my own findings with Repeat Masker and from other papers.

Thank you for clarification on this.

pangenome_tree_bootstrap

glennhickey commented 4 years ago

Softmasking is important for Cactus. It attempts to perform its own in a preprocessing step, but having RepeatMasker'd softmasking at the outset helps considerably.

Softmasked sequences are ignored by cactus when it's performing pairwise and self alignments that are used to construct the initial graph. Without this logic, the graph is too collapsed and nothing works properly downstream.

After the initial graph is constructed, regions are processed locally using a more sensitive algorithm. Softmasked sequences will get considered then. So softmasked sequences can be aligned, but they would generally need to be anchored by unmasked alignments to do so.

And yeah, softmasking will probably only serve to confuse vg. If you're using hal2vg to convert, the default output format (PackedGraph) will strip the softmasking automatically.

On Tue, Aug 18, 2020 at 10:53 PM Brett Chapman notifications@github.com wrote:

Hi

In the Cactus paper (https://www.biorxiv.org/content/10.1101/730531v3) it mentions that it expects the input genomes to be soft-masked. I was wondering how Cactus treats the soft-masked regions.

I've consulted with authors of minimap2 and VG and both recommend to not mask the genomes for whole genome alignment. My intention is to perform a 20 genome alignment with Cactus (see guide tree attached), and pull the alignments from Cactus into VG for further alignment of genomics reads of non-assembled varieties to the pangenome, and also for visualisation and exploratory analysis using tools such as sequenceTubeMap ( https://academic.oup.com/bioinformatics/article/35/24/5318/5542397) and MoMIG ( https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3145-2). If others recommend no repeat masking, I was wondering what do the authors of Cactus recommend? Does Cactus still align through the soft-masked regions, but treat those alignments differently (perhaps considering only unique alignments?). My genomes are roughly 70-80% repeat masked based on my own findings with Repeat Masker and from other papers.

Thank you for clarification on this.

[image: pangenome_tree_bootstrap] https://user-images.githubusercontent.com/8529807/90586009-b7d49700-e208-11ea-97ba-9fc24a9feaf0.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373VYOO2L6NMGAF7COYLSBM5CFANCNFSM4QEOQQZA .

brettChapman commented 4 years ago

Thanks for the clarification.

I'll soft-mask only if using Cactus in future.

I've worked with converting some HAL files before on a smaller scale than my 20 genome study, and I found hal2vg would use up all my RAM and kill the job, and that was with just a 3 genome example on a single chromosome (I only have 64GB RAM on a single node)

I found the following got around the problem using Seqwish: hal2fasta --hdf5InMemory --subtree --upper output.hal Anc0 > pangenome.fa hal2paf --hdf5InMemory output.hal > pangenome.paf seqwish -t 16 -p pangenome.paf -s pangenome.fa -g pangenome.gfa vg convert -g pangenome.gfa -p > pangenome.pg

Would this method also strip the graph of soft-masked regions? or would only hal2vg do this?

glennhickey commented 4 years ago

hal2fasta --upper in that pipeline does convert everything to upper case.

hal2vg uses considerably less memory since about July, so may be worth another try if you haven't used it since (a standalone binary release is now available). seqwish should still scale better in terms of memory.

The hal/paf/seqwish pipeline will leave some snps uncollapsed. You can merge them afterwards using either vg mod -n or smoothxg, but I don't think either of these methods are extremely robust at the moment. @ekg is actively developing smoothxg, so you can probably rely on it soon if not already.

ComparativeGenomicsToolkit / cactus

How does Cactus treat repeat/soft-masked regions #295