ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

Compute resource requirements #217

Open brettChapman opened 4 years ago

brettChapman commented 4 years ago

Hi

I'm interested in using Cactus to perform a multiple sequence alignment of 20 Barley genomes, each of around 4.3-5.3GBps. I plan to request access to a supercomputing facility here, but I imagine the compute requirements for running 20 genomes will likely be well beyond the compute resources I can feasibly request. My plan was to run Cactus and then use VG to create a variation graph and then explore different regions of the sequences using sequenceTubeMap. Is there a way in which I could get around the compute requirements, such as limiting the number of genomes into smaller batches, but keeping the main reference quality genome in each batch, or avoid Cactus altogether and simply use a reference based alignment tool such as ClustalO?

Until a more memory efficient version of Cactus becomes available, do you have any suggestions to help limit the resource overhead for such a large study?

Thank you.

Regards

Brett

diekhans commented 4 years ago

Hi Brent,

That is an interesting data set. If you can break your genomes up into a evenly premonition tree, we have ways to decompose the problem.

How good are the non-reference genomes. If they are not reasonably high-quality, a reference-base alignment might be the place to start.

Brett Chapman notifications@github.com writes:

Hi

I'm interested in using Cactus to perform a multiple sequence alignment of 20 Barley genomes, each of around 4.3-5.3GBps. I plan to request access to a supercomputing facility here, but I imagine the compute requirements for running 20 genomes will likely be well beyond the compute resources I can feasibly request. My plan was to run Cactus and then use VG to create a variation graph and then explore different regions of the sequences using sequenceTubeMap. Is there a way in which I could get around the compute requirements, such as limiting the number of genomes into smaller batches, but keeping the main reference quality genome in each batch, or avoid Cactus altogether and simply use a reference based alignment tool such as ClustalO?

Until a more memory efficient version of Cactus becomes available, do you have any suggestions to help limit the resource overhead for such a large study?

Thank you.

Regards

Brett

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/217 Hi

I'm interested in using Cactus to perform a multiple sequence alignment of 20 Barley genomes, each of around 4.3-5.3GBps. I plan to request access to a supercomputing facility here, but I imagine the compute requirements for running 20 genomes will likely be well beyond the compute resources I can feasibly request. My plan was to run Cactus and then use VG to create a variation graph and then explore different regions of the sequences using sequenceTubeMap. Is there a way in which I could get around the compute requirements, such as limiting the number of genomes into smaller batches, but keeping the main reference quality genome in each batch, or avoid Cactus altogether and simply use a reference based alignment tool such as ClustalO?

Until a more memory efficient version of Cactus becomes available, do you have any suggestions to help limit the resource overhead for such a large study?

Thank you.

Regards

Brett

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.*

brettChapman commented 4 years ago

Hi Mark

By premonition tree, do you mean guide tree (newick format)?

All the genomes are in pseudomolecules, arranged and organised based on the reference Morex assembly published a few years ago (https://www.nature.com/articles/nature22043) (Morex has since been updated and is in v2 of the assembly), so they're all pretty good references. No scaffolds or contigs. All organised by chromosome. In addition we have 2 wild-type varieties which are only scaffolds, and we may align those too to provide a complete picture.

glennhickey commented 4 years ago

Hi Brett,

You definitely need a phylogeny for this data, and it should be fully resolved (ie binary). Once you have this you can make an input file as described in the README:

https://github.com/ComparativeGenomicsToolkit/cactus#seqfile-the-input-file

Of particular importance is that second paragraph, beginning with "An optional * can be placed at the beginning of a name to specify that its assembly is of reference quality." You can then flag your reference assembly in this way.

You can then try cactus-prepare to work on subtrees independently as described here:

https://github.com/ComparativeGenomicsToolkit/cactus#running-step-by-step-experimental

Because you flagged your reference, it should end up being used as an outgroup even in subtrees that don't contain it (which I think is what you are looking for). By running one of these subtrees, you will hopefully get a sense of the overall compute requirements as the type of output to expect when running the whole thing.

On Tue, Apr 14, 2020 at 3:27 AM Brett Chapman notifications@github.com wrote:

Hi Mark

By premonition tree, do you mean guide tree (newick format)?

All the genomes are in pseudomolecules, arranged and organised based on the reference Morex assembly published a few years ago ( https://www.nature.com/articles/nature22043) (Morex has since been updated and is in v2 of the assembly), so they're all pretty good references. No scaffolds or contigs. All organised by chromosome. In addition we have 2 wild-type varieties which are only scaffolds, and we may align those too to provide a complete picture.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/217#issuecomment-613272823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373XAJAHSVZQSP7WCN2TRMQF5BANCNFSM4MHMOQPA .

brettChapman commented 4 years ago

Hi Glenn

Thanks for the feedback. This should significantly reduce the computational requirements. Once we're up and running on the cluster, I'll try what you've suggested and see how it goes.

Cheers

Brett