fwhelan / coinfinder

A tool for the identification of coincident (associating and dissociating) genes in pangenomes.
GNU General Public License v3.0
92 stars 9 forks source link

Does not progress past the 'calculating lineage dependence' step. #75

Closed VishnuRaghuram94 closed 4 months ago

VishnuRaghuram94 commented 5 months ago

Hello,

Thank you for developing coinfinder. According to your publication, about 4 million pairwise tests took 7 minutes with 20 cpu cores. I am running a dataset with 547 genomes and 4872 genes, which is 11,870,628 pairwise tests. I understand it would take longer but I am running it on a HPC with 128 cpu cores and the job gets stuck at the 'Calculating lineage dependence' stage for more than 24 hours (and eventually times out due to the HPC limits).

Reading arguments...
> CORRECTION ······· = BONFERRONI
> METHOD ··········· = COINCIDENCE
> ALT_HYPOTHESIS ··· = GREATER
> MAX_MODE ········· = ACCOMPANY
> SET_MODE ········· = FULL
> PERMIT_FILTER ···· = NO
> VERBOSE ·········· = NO
> OUTPUT_ALL ······· = NO
> FRACTION CUTOFF ·· = N/A
> SIGNIFICANCE_LEVEL = 0.05
> COMBINED_FILE ···· = gene_presence_absence_int-withquotes.csv
> GENE_NAME ······· = Genes
> GENOME_NAME ········ = Genomes
Formating Roary output for input into coinfinder...
Reading gene-genome edges...
- n.GENES = 4872
- n.GENOMES  = 547
- n.EDGES = 1008669
Dropping saturated sets...
Nothing dropped due to node saturation, your data is good to go. :)
Dropping rare elements in collection...
Nothing dropped due to rare elements in collection, your data is good to go. :)
Dropping empty sets...
Nothing dropped, your data is good to go.
Iterating matrix...
Bonferroni significance correction, given 11870628 tests, the significance level has been reduced from 0.05 to 4.21208e-09.
Running analyses...
Calculating lineage dependence...

Is this expected behaviour? Running it on a smaller dataset (for example the same presence absence table but only the first 500 genes) works just fine. Do you have any tips to speed up the process? Any advice would be appreciated.

Thank you, Vishnu

fwhelan commented 4 months ago

Hi Vishnu,

This step is unfortunately the slowest. It does run in parallel, so please make sure you set the -x num_cores flag to what is available to you.

The computing time to calculate lineage independence is dependent on the number of genes which are involved in co-occurrence/avoidant relationships (as opposed to the number of pairwise tests etc.). I've routinely run coinfinder with ~20,000 genes as input (not necessarily all will be involved in a coincident relationship and thus its lineage independence tested though) using 10 cores which have run in ~24-48hrs so I am a bit surprised.

You can get a sense of how far through coinfinder managed to get before your HPC kicked out the job by comparing the number of elements in the _nodes_in.csv file vs. the number of lines in your _nodes.tsv output. For e.g.:

head coinfinder_nodes_in.csv | grep -o "," | wc -l
wc -l coinfinder_nodes.tsv 

This will help you determine if the run is close to your full geneset (i.e. just needed a few more hours) or if something is going wrong at this step...

Let me know and I can help you find a workaround if needed.

VishnuRaghuram94 commented 4 months ago

Thank you for your response. yes I am setting the -x flag.

coincident_nodes_in.csv has 3732 elements and coincident_nodes.tsv has only one line which appears to be the header. Is this expected?

As for the other files that were created, coincident-input-edges.csv has 4782 distinct elements in column 1 and 547 distinct elements in column 2, which match my no. of genes and genomes respectively. coincident_pairs.tsv and coincident_edges.tsv have 150769 lines showing gene co-occurrence pairs.

Thanks, Vishnu

fwhelan commented 4 months ago

Hi Vishnu,

No, that's not expected behaviour. When you run your smaller test set, is coincident_nodes.tsv being populated?

VishnuRaghuram94 commented 4 months ago

Yes it is populated in my smaller test set. the number of elements in coincident_nodes_in.csv match the number of lines in coincident_nodes.tsv (minus the header)

fwhelan commented 4 months ago

Something's up with the larger dataset then. Could be something specific to the HPC (you could try to run it locally long enough to see if the first few lines of _nodes.tsv are populated)? Or could be some specific to the first gene name in the larger vs. smaller dataset that's causing an issue?

VishnuRaghuram94 commented 4 months ago

I doubt it is the first gene name as the smaller set is just the first few lines of the larger set. It could be something specific to the HPC, I will try running it on a different server and also locally and report back.

Thanks for your help!

VishnuRaghuram94 commented 4 months ago

Running it locally on the full dataset populates the _nodes.tsv file. You may be right in that it has to do with the HPC.

Do you have ideas for a workaround where coinfinder simply outputs the gene co-occurrence and not the lineage dependence or the visualizations? In my case coincident_pairs.tsv is populated, does that mean the co-occurrence calculations are complete?

fwhelan commented 4 months ago

Yep, exactly, the _pairs.tsv file has all the co-occurrence information, you'll just be missing the information about lineage dependence which is very biologically important, but could be calculated separately for pairs of interest (for e.g.) depending on what you're research question is.

VishnuRaghuram94 commented 4 months ago

Thanks for your help! Feel free to close the issue if you wish.