HetzDra / turboGliph

R implementation of GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots), an algorithm developed by Glanville et al to identify specificity groups in the T cell receptor repertoire based on local (motif sharing) and global (hamming distance) similarities.
17 stars 4 forks source link

Recommendations for plotting large networks #2

Closed leeanapeters closed 1 year ago

leeanapeters commented 1 year ago

Hi, thank you for this great package!

I have a large # of sequences that I have clustered using this pipeline but I am unable to use the interactive visualization feature through vizNetwork as even on a high performance cluster my session freezes and does not display the plot. I have tried to save the plot as an html and am also encountering issues.

Do you have any recommendations for this? I was hoping to use this interactive plot to pick downstream clusters of interest, but I suppose I could also do this computationally by looking at which clusters are overrepresented in my condition of interest? How would you recommend to subset the object/calculate this?

Thanks

Leeana

HetzDra commented 1 year ago

Dear Leeana,

we are glad to hear about your positive opinion of the turboGliph package.

It's unfortunate that you are facing problems with the visualization function. In order to find the reason for this and to be able to suggest possible solutions, I need some more concrete information about your input data and your result. I would be grateful if you could elaborate on the information:

  1. Which function of the package did you use to analyze your data (turbo_gliph, gliph2 or gliph_combined)?
  2. How many sequences did you put into the analysis?
  3. In your output object, how many rows does the data frame $connections have?
  4. How many clusters were identified and what is the size of the largest clusters?
  5. At what step does the execution of the program freeze (what was printed on the command line by the plot_network function up to that point)?

Independent of the visualization, the output of the gliph functions provides you with all information about the clusters. The size, members, scores and any other information of each cluster are summarized in the data frame $cluster_properties. The list $cluster_list contains all sequences for each cluster with all additional information of the input data frame. The package has recently been updated to version 0.99.2, which now provides a detailed vignette of the functions and especially the output of the functions. Hopefully this gives you all the information you need to select the clusters you are interested in. How far you can computationally select interesting clusters with this information depends on your question and your data. With the information so far I can't give you a detailed recommendation for the analysis.

I hope I could already give you some impulses for further processing of your data. With your answers I will check a solution strategy for the visualization.

For further questions and feedback, I will gladly be at your disposal.

Kind regards, Jan

leeanapeters commented 1 year ago

Hi thank you for the quick reply!

I used probably a larger number of sequences than recommended (~1E6) but we have many donors and are looking for rare events.

The function I used was gliph2, and the largest cluster size was 175. The connections dataframe has 2.8E6 rows. Plot network goes all the way to the draw stage and finishes, but then I get an empty plot (which is still true if I just export to HTML) and my R studio IDE freezes and essentially crashes if I try to use the interactive feature.

Thank you for your help and please let me know if I can provide more info

Leeana

leeanapeters commented 1 year ago

As an update,

I have now tried a range of 5000 to 50000 input sequences and I was able to get the lowest range working (5000 input sequences, 10-20 clusters). I am currently trying to get a run of 20K sequences with a couple hundred clusters to render and am having issues still with being unable to view the output.

HetzDra commented 1 year ago

Dear Leeana,

thank you for sharing the needed information.

Considering your output sizes and your explanation of the problem, the cause is probably the call of the visNetwork package. My own experience is that for graphs with >3E5 connections, the interactivity of the output graph is severely limited by long loading times. You probably won't be able to output all clusters in one plot at the same time. However, there is the possibility to display only a part of the clusters with special interest in the graph. There are two options for this:

1) With this dimension of input sequences, smaller clusters are often present in the majority, but have a lower statistical significance and thus informative value. For this reason, the parameter cluster_min_size is provided in plot_network, with which clusters are only displayed as soon as they exceed the specified size. This focuses the analysis on the larger clusters.

2) The display of the clusters is determined by the data frame $cluster_properties of the output of the gliph function. You can assign the output of the gliph function to a helper variable and reduce the clusters in the data frame $cluster_properties to a minimum using filter aspects chosen by you. As an example, you could display local and global clusters separately in two graphs. Below I have shown you an example code to display only clusters based on local connections with more than 10 members.


library(turboGliph)
library(dplyr)

# perform analysis with input sequences
res_gliph2 <- turboGliph::gliph2(cdr3_sequences = input_sequences)

# store ouput in helper variable for cluster subsetting
temp_res <- res_gliph2

# filter clusters with only local connections
temp_res$cluster_properties <- temp_res$cluster_properties %>%
  filter(type == "local")

# plot network with only local clusters with more or equal to 10 members
plot_network(clustering_output  = temp_res,
             cluster_min_size = 10)

I hope I could help you and you can continue your graphical analysis with these possibilities. Since the computational problem lies in the package used to plot the network, I cannot acutely present a solution to plot all connections.

Kind regards, Jan