orthofinder computation time on very big data

davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics

https://davidemms.github.io/

GNU General Public License v3.0

710 stars 189 forks source link

orthofinder computation time on very big data #778

Open alambard opened 1 year ago

alambard commented 1 year ago

Hello , i have launched an orthofinder analysis on ~150Go of transcriptomic data coming from sra NCBI database. It's now computing since almost 30 days . I just need some experience about someone already used the pipeline for big data also ; to have an idea of how much time is needed to have the algorithm completed . At the moment it is still in blast comparison step. I have seen in recent topics that using the mmseqs algorithm is faster . Here is the command i used :

orthofinder -t 15 -a 15 -S mmseqs -d -f gambierdiscus/

80 GO of memory and 16 CPUs configuration values on the server

davidemms commented 1 year ago

How many input fasta files have you provided and how many sequences are there in total?

It sounds like a very large analysis. I would suggest doing a test where you reduce your input data to about 8x smaller. This should run using approximately 64x less RAM and 64x runtime than you're full analysis. Compare these numbers to what computational resources you have available. This should give a guide as to whether your full analysis is achievable.

All the best David

alambard commented 1 year ago

Hello

I have 8 fasta files from 1Go for the mallest and 94 Go for the biggest , ~220 Go in total ( more than 150 tho)

here 's my number of total sequences : 1 472 984 533

im going to test by compressing the files 8x smaller i guess