Running on large dataset

NicolasNaepflin commented 1 year ago

Hi Sion, Thanks for developing this tool! I have been using it a while now for smaller datasets (< 1000 genomes) without issues and it has been very useful.

Recently I was looking into processing larger (~ 10000 genomes) and potentially also more diverse datasets.

Do you have any input/ experience into processing large datasets? (eg. Are there other options to improve the run time apart from increasing the number of threads/ cores and using diamond instead of BLAST?)

Additionally, for more diverse genomes such as the Prochlorococcus example in your original publication, you used an MCL inflation value of 6. As far as I know, larger inflation parameters tend to produce a more fine-grained clustering. Was there any benchmarking (or other tests) performed to choose this inflation value?

Thank you in advance

Nicolas

SionBayliss commented 1 year ago

Hi Nicolas,

Thanks for using it!

PIRATE has been successfully used on very large datasets >25,000 genomes. I would suggest that you:

1/ Check the genomes for quality. One poor quality genome can have a detrimental effect on the clustering and especially the paralog identification and classification. 2/ Start with a much smaller subset of your most diverse samples so that you can pick a range of thresholds (--steps) that accurately captures the diversity in your collection. You could also experiment with inflation values here to ensure sensible clusters are produced. I am afraid I don't have any tips for selecting an MCL inflat value for you :( 3/ Don't run it with gene alignment, it will take ages to finish and can be run separately or on genes of interest afterwards.
4/ You can also run it with paralog detection off (--para-off) on the initial run as this can take a long time to complete. It can then be rerun with paralog detection only, using the --pan-off option, once it has finished clustering at least once. You WILL need to keep intermediate files on each run for this to work (-z 2). I would test the workflow on a smaller subset so that you don't put the wrong options in on your full set and remove intermediate files or have to reprocess everything :)
5/ Throw as many cores as you can at it.

I hope that helps, S

NicolasNaepflin commented 1 year ago

Hi Sion

Thanks for the quick input! I will let you know how it will work for me

SionBayliss / PIRATE

Running on large dataset #85