Warning in cluster_network()

xiaoyezao commented 2 years ago

Dear developers,

I encountered this warning message in the cluster_network() run:

Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  restarting interrupted promise evaluation

Is this a serious problem? Can I go on with the results?

Thank you,

Tao

almeidasilvaf commented 2 years ago

Hi, @xiaoyezao

I've never seen this warning before, and I can't reproduce it. Could you show the code you ran to generate this warning?

What does the output of cluster_network() look like for your data?

This may sound silly, but have you tried restarting your R session and running cluster_network() again? It seems like a warning that happens randomly. Likewise, do you get the same message if you rerun cluster_network() in the same session?

xiaoyezao commented 2 years ago

I just followed the instructions, the code is simply clusters <- cluster_network(net). This finished smoothly other than the warning message, and the generated result is like this:

> head(clusters)
                                   Gene Cluster
1 Apiaceae_Apium_graveolens_Ag1G01158.1       1
2 Apiaceae_Apium_graveolens_Ag1G01159.1       2
3 Apiaceae_Apium_graveolens_Ag1G01160.1       3
4 Apiaceae_Apium_graveolens_Ag1G01165.1       4
5 Apiaceae_Apium_graveolens_Ag1G01168.1       5
6 Apiaceae_Apium_graveolens_Ag1G01179.1       6

I continued profiles <- phylogenomic_profile(clusters) with this result, but got the following error:

Error in stats::hclust(dist_mat, method = "ward.D") : 
  size cannot be NA nor exceed 65536

Any suggestions on this?

almeidasilvaf commented 2 years ago

Could you share your net and clusters objects so I can try to inspect this issue?

You can save them as an .rda file and push the file to a repo that I can access. Something like this:

save(clusters, net, file = "network_and_clusters.rda", compress = "xz")

xiaoyezao commented 2 years ago

please use this link to download the data https://drive.google.com/file/d/1Eajys70brfYKHw68mtt1O6YDGpaUqxRL/view?usp=sharing

Let me know if this doesn't work. Thank you!

almeidasilvaf commented 2 years ago

Hi, @xiaoyezao

I've just checked it now and there are some issues with your data:

You need to follow the pipeline from the beginning. If you run the function process_input(), it will process your annotation and seq objects to add a unique species identifier in front of gene IDs and chromosome names. This identifier is a 3-5-long string. The gene IDs in your data have long identifiers (e.g., Apiaceae_Apiumgraveolens), which means you tried to process the data by yourself; don't do it, or it will simply not work.
Given that you didn't follow the pipeline from the beginning, I would not trust the results you obtained, including your synteny network and your clusters. For example, your network was clustered into 73244 different clusters. That's a lot! Upon inspection, I found that most of them (43302, ~60%) have 2 genes only. Give it a check with:

library(tidyverse)

# Get number of clusters
clusters %>% count(Cluster) %>% nrow()

# Get number of clusters with 2 nodes only
clusters %>% count(Cluster) %>% filter(n == 2) %>% nrow()

Although this can be a real property of your data set (e.g., if you have 2 species that are very distantly related to all other species), I'd say it is likely a problem resulting from the fact that you didn't do the processing with process_input().

After running the whole pipeline properly, if you still find this huge amount of 2-node clusters, I'd suggest filtering your clusters prior to phylogenomic profiling like this:

clusters_to_keep <- clusters %>% count(Cluster) %>% filter(n > 2) %>% select(Cluster)
fclusters <- clusters[clusters$Cluster %in% clusters_to_keep$Cluster, ]

xiaoyezao commented 2 years ago

Thank you for your debugging!

I obtained the network using the shell version SynNet https://github.com/zhaotao1987/SynNet-Pipeline/wiki/SynNet-Build, and then feed the result to cluster_network() in R.

The huge 2-node clusters could be real because I have a few genomes from different plant orders which are quite far related. If I want to remove these few far-related genomes, can just remove these related clusters from the network? Or do I have to rerun from the very beginning?

BTW, I used long gene names because these data are also used in my other phylogenomic analyses, and I want to keep the "taxonomic information" of the genes. For me, process_input() is quite strict on the gene names, so I prepared the sequence andannotation data using a custom script following the rules of process_input() except that the gene names are processed differently.

I will filter the small clusters to see how it will be going

Thanks

almeidasilvaf commented 2 years ago

My pleasure to help!

Regarding your points:

You can just remove these genomes from the network. That would be faster, indeed.
You can let process_input() create the species identifiers automatically for you, and then create a data frame containing the mapping between identifiers and taxonomic information. Again, if you don't run the pipeline from the beginning, you will probably have problems with the downstream analyses in syntenet. You might want to look at this software demo of syntenet that I will present at EuroBioc2022 next month. In this slide presentation, I demonstrated how one can keep taxonomic information for each gene.

I will close this issue. If you have any issues after running the complete pipeline (starting from the beginning), feel free to open a new issue here.

Thank you for using syntenet! ;)

almeidasilvaf / syntenet

Warning in cluster_network() #7