Closed boasvdp closed 3 years ago
Hi,
Apologies that I missed this issue, it has been a very busy few months for me. Is this still something that you would like? I would be happy to modify the code for you.
All the best, Sion
Hi, no worries! This would definitely still be useful. I'd be very grateful if you could implement this!
I have just pushed a commit to master with an additional option in align_feature_sequences.pl. You can run align_feature sequences after the PIRATE run has completed. If you need to chunk your data into smaller jobs then simply subset the PIRATE.gene_families.tsv file into separate files, the script will only process/align the genes in the input file (provided with -i). You can switch off alignment using the --align-off (-a) switch.
For example:
PIRATE/scripts/align_feature_sequences.pl -i ./PIRATE.gene_families.tsv -g ./modified_gffs/ -o ./feature_sequences/ -p number_of_thread -d highest_gene_copy_number_to_include(e.g. 1.25) --align-off
I hope that helps, Sion
Great, thanks! I will test it on my data soon, will let you know how it works for me!
First of all a great thank you for this tool, I have been using it with a lot of pleasure!
When analysing larger collections of genomes (e.g. >2500 E. coli), I run into issues with the job scheduler my university's HPC cluster uses. There is a 5 day limit for job running time and in some cases, PIRATE cannot align all core genes within 5 days on our system.
Is there a way for PIRATE to write the gene sequences to file, but not run the MAFFT alignment itself? The MAFFT alignment can then be sent to the scheduler in separate jobs which will help stay under the 5 day limit. I have tried to figure out how this would work, but I'm no perl expert. I think basically
align_feature_sequences.pl
would stop around line 390, if possible.Would this be an option to add? Or do you have other ideas about how to handle larger datasets? Many thanks in advance.