SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
91 stars 29 forks source link

Output gene sequences to run gene alignment separately #71

Closed boasvdp closed 3 years ago

boasvdp commented 3 years ago

First of all a great thank you for this tool, I have been using it with a lot of pleasure!

When analysing larger collections of genomes (e.g. >2500 E. coli), I run into issues with the job scheduler my university's HPC cluster uses. There is a 5 day limit for job running time and in some cases, PIRATE cannot align all core genes within 5 days on our system.

Is there a way for PIRATE to write the gene sequences to file, but not run the MAFFT alignment itself? The MAFFT alignment can then be sent to the scheduler in separate jobs which will help stay under the 5 day limit. I have tried to figure out how this would work, but I'm no perl expert. I think basically align_feature_sequences.pl would stop around line 390, if possible.

Would this be an option to add? Or do you have other ideas about how to handle larger datasets? Many thanks in advance.

SionBayliss commented 3 years ago

Hi,

Apologies that I missed this issue, it has been a very busy few months for me. Is this still something that you would like? I would be happy to modify the code for you.

All the best, Sion

boasvdp commented 3 years ago

Hi, no worries! This would definitely still be useful. I'd be very grateful if you could implement this!

SionBayliss commented 3 years ago

I have just pushed a commit to master with an additional option in align_feature_sequences.pl. You can run align_feature sequences after the PIRATE run has completed. If you need to chunk your data into smaller jobs then simply subset the PIRATE.gene_families.tsv file into separate files, the script will only process/align the genes in the input file (provided with -i). You can switch off alignment using the --align-off (-a) switch.

For example:

PIRATE/scripts/align_feature_sequences.pl -i ./PIRATE.gene_families.tsv -g ./modified_gffs/ -o ./feature_sequences/ -p number_of_thread -d highest_gene_copy_number_to_include(e.g. 1.25) --align-off

I hope that helps, Sion

boasvdp commented 3 years ago

Great, thanks! I will test it on my data soon, will let you know how it works for me!