Closed xiaoyezao closed 2 years ago
Hi, @xiaoyezao
I've never seen this warning before, and I can't reproduce it. Could you show the code you ran to generate this warning?
What does the output of cluster_network()
look like for your data?
This may sound silly, but have you tried restarting your R session and running cluster_network()
again? It seems like a warning that happens randomly. Likewise, do you get the same message if you rerun cluster_network()
in the same session?
I just followed the instructions, the code is simply clusters <- cluster_network(net)
. This finished smoothly other than the warning message, and the generated result is like this:
> head(clusters)
Gene Cluster
1 Apiaceae_Apium_graveolens_Ag1G01158.1 1
2 Apiaceae_Apium_graveolens_Ag1G01159.1 2
3 Apiaceae_Apium_graveolens_Ag1G01160.1 3
4 Apiaceae_Apium_graveolens_Ag1G01165.1 4
5 Apiaceae_Apium_graveolens_Ag1G01168.1 5
6 Apiaceae_Apium_graveolens_Ag1G01179.1 6
I continued profiles <- phylogenomic_profile(clusters)
with this result, but got the following error:
Error in stats::hclust(dist_mat, method = "ward.D") :
size cannot be NA nor exceed 65536
Any suggestions on this?
Could you share your net
and clusters
objects so I can try to inspect this issue?
You can save them as an .rda file and push the file to a repo that I can access. Something like this:
save(clusters, net, file = "network_and_clusters.rda", compress = "xz")
please use this link to download the data https://drive.google.com/file/d/1Eajys70brfYKHw68mtt1O6YDGpaUqxRL/view?usp=sharing
Let me know if this doesn't work. Thank you!
Hi, @xiaoyezao
I've just checked it now and there are some issues with your data:
process_input()
, it will process your annotation
and seq
objects to add a unique species identifier in front of gene IDs and chromosome names. This identifier is a 3-5-long string. The gene IDs in your data have long identifiers (e.g., Apiaceae_Apiumgraveolens), which means you tried to process the data by yourself; don't do it, or it will simply not work.library(tidyverse)
# Get number of clusters
clusters %>% count(Cluster) %>% nrow()
# Get number of clusters with 2 nodes only
clusters %>% count(Cluster) %>% filter(n == 2) %>% nrow()
Although this can be a real property of your data set (e.g., if you have 2 species that are very distantly related to all other species), I'd say it is likely a problem resulting from the fact that you didn't do the processing with process_input()
.
After running the whole pipeline properly, if you still find this huge amount of 2-node clusters, I'd suggest filtering your clusters prior to phylogenomic profiling like this:
clusters_to_keep <- clusters %>% count(Cluster) %>% filter(n > 2) %>% select(Cluster)
fclusters <- clusters[clusters$Cluster %in% clusters_to_keep$Cluster, ]
Thank you for your debugging!
I obtained the network using the shell version SynNet
https://github.com/zhaotao1987/SynNet-Pipeline/wiki/SynNet-Build, and then feed the result to cluster_network()
in R.
The huge 2-node clusters
could be real because I have a few genomes from different plant orders which are quite far related. If I want to remove these few far-related genomes, can just remove these related clusters from the network? Or do I have to rerun from the very beginning?
BTW, I used long gene names because these data are also used in my other phylogenomic analyses, and I want to keep the "taxonomic information" of the genes. For me, process_input()
is quite strict on the gene names, so I prepared the sequence
andannotation
data using a custom script following the rules of process_input()
except that the gene names are processed differently.
I will filter the small clusters to see how it will be going
Thanks
My pleasure to help!
Regarding your points:
process_input()
create the species identifiers automatically for you, and then create a data frame containing the mapping between identifiers and taxonomic information. Again, if you don't run the pipeline from the beginning, you will probably have problems with the downstream analyses in syntenet
. You might want to look at this software demo of syntenet that I will present at EuroBioc2022 next month. In this slide presentation, I demonstrated how one can keep taxonomic information for each gene.I will close this issue. If you have any issues after running the complete pipeline (starting from the beginning), feel free to open a new issue here.
Thank you for using syntenet! ;)
Dear developers,
I encountered this warning message in the cluster_network() run:
Is this a serious problem? Can I go on with the results?
Thank you,
Tao