Apply closest distance - Githubissues

Jordi-Valls commented 2 years ago

Hi Emre,

First of all congratualtions for your great job, is really impressing. Now I'm contacting to you, because I read the paper "Network-based in silico drug efficacy screening" where you calculate the proximity between nodes using different formulas such as closest, shortest, kernel, centre,And also reference distance distribution, among others.

I'm interesting to calculate the closest and reference distance distribution and as far as I know I need the proximity package of this github. The problem remains about to know, which output is the closest and reference distance distribution, they are d and z respectively? I have my target drug nodes and disease nodes, and I want to calculate their proximity in order to know if my drug could potentially treat my disease.

Besides, how can I create the proximity.dat, proximity_similarity.dat and palliative.csv files to execute with my data the proximity.ipynb file?

THanks for your time

Jordi

emreg00 commented 2 years ago

Hi Jordi, thanks for your kind words, I have seen you have located the code for proximity on toolbox. Let me know if there is sth you think I can modify in the README of this package to make this point clearer.

The proximity values can be found on data/proximity folder (categorized based on the measure), similarity values can be found under data/similarity folder (categorized based on the type of the data). The information on palliative drugs are available on the excel file (Supplementary Data 1) in the article.

Hope this helps,

Emre

Jordi-Valls commented 2 years ago

Hi Emre,

Sorry maybe I do not explain correctly. My questions was about how can I do these analisis with my own data. The question regarding z, and d is because I dont know if z is the reference distance distribution and "d" is the Closest result from paper, because the module "calculate_proximity" from wrapper give a z, d, mean and sd, but I dont know what is the Closest metric, and the reference distance distribution.... For me is not clear which results I obtaned when execute the command d, z, (mean, sd) = wrappers.calculate_proximity(network, nodes_from, nodes_to, min_bin_size = 2)

What is d? and what is z? How can I interpret the values? I want to use my network, my nodes_from, my nodes_to and calculate the Closest distance and the reference distance distribution. And if is possible I would like to calculate these other metrics such as shortest, kernel, centre.... But I dont know how to do it... Besides, I dont know how you create the proximity.dat, proximity_similarity.dat and palliative.csv files... Because I want to create those files for my analyses and execute the proximity.ipynb script, in order to calculate the AUCROC, among other interesting things, but I need to generate those files first...

Maybe the proximity valyues files, similarity, and palliative are generic files?? or have to be calculated for different conditions?

I hope it is more clear!

Thanks for your help

Jordi

Jordi-Valls commented 2 years ago

Hi Emre, After evaluate the reason of the output obtained after execute the following comand: d, z, (mean, sd) = wrappers.calculate_proximity(network, nodes_from, nodes_to, min_bin_size = 2)

Now I know that the z is the reference distance distribution (defined in this paper "Network-based in silico drug efficacy screening"), and d is the closest metric?

About the min_bin_size, how I know the max number of nodes in the node sets? because I underestand well, I've to apply a clustering for all nodes of my network right?? Otherwise I dont know which is the best way to select the min_bin_size....

In addition, after execute the proximity pipline, I obtained an z= -0.40347217732097307 with my nodes, being disease nodes as nodes_from and drug_targets the nodes_to. Which is the threshoold to know if a drug can treat a disease??

Thanks and sorry for so many questions...

emreg00 commented 2 years ago

Hi Jordi,

No worries, please find below my comments in relation to your questions:

d: the observed distance between the two sets of nodes.
z: the z-score calculated comparing the observed distance (d) to the background distribution of distances calculated using random node sets. Depending on the problem you are trying to address you can consider a strict cutoff such as z < -2 (the two node sets are significantly close) or a relaxed cutoff such as z < -0.5 (the two node sets are closer but not statistically significant). In the analysis in the article, we opted for the latter given drugs tend to have low target specificity and drug targets interact with many proteins in biological networks that are known to be small world (where one can reach to any other node in the networks in few steps).
min_bin_size: the min size (min number of nodes) of the bins used for matching the degrees of the nodes in the random node sets with the degrees of the nodes for which d is calculated for. Typically a size of 100 would be sufficient to have representative set of nodes for random selection (for biological networks containing vertices in the order of 10,000 nodes). You can lower this number if you have a small network or increase it if you have a very large network. You can find more info on the article but briefly the idea is to be have sufficient number of nodes for especially higher degree nodes to able to sample "equivalent" nodes that have high degrees.
distance measure: The wrapper script uses the "closest" measure by default. For using other measures you would have to revert to the underlying more primitive network_utilities function. In the calculate_proximity function in wrappers above, you would swap "calculate_closest_distance" with "get_separation" function in which you can provide "distance" as a parameter (possible values, "closest", "shortest", "kernel", "center", "jorg-closest", corresponding to the 5 measures mentioned in the article).
nodes_to vs nodes_from: It is a good practice to use drug targets as "nodes_from" and disease genes as "nodes_to", provided that the number of drug targets are less than the number of disease genes (this was the case for the complex diseases we were working with). Note that the measures are not symmetric and the results would be different if you swap the node sets.

Hope this clarifies the doubts you have, let me know if you need further clarification.

Emre

Jordi-Valls commented 2 years ago

Hi Emre,

Thanks for your reply, for me now is really clear.

I hope you enjoy the new year entrance Best

Jordi

xianggeshamingzi commented 2 years ago

嗨乔迪，

不用担心，请在下面找到我对您的问题的评论：

d：观察到的两组节点之间的距离。

z：比较观察距离 (d) 与使用随机节点集计算的距离的背景分布计算得出的 z 分数。根据您要解决的问题，您可以考虑严格的截止值，例如 z < -2（两个节点集非常接近）或宽松的截止值，例如 z < -0.5（两个节点集更接近但统计上不重大）。在本文的分析中，我们选择了后者，因为药物往往具有较低的靶标特异性，并且药物靶标与生物网络中的许多蛋白质相互作用，这些蛋白质已知是小世界（可以到达网络中的任何其他节点）几步）。

min_bin_size：用于将随机节点集中节点的度数与为其计算d的节点的度数进行匹配的bin的最小大小（最小节点数）。通常，100 的大小足以具有用于随机选择的代表性节点集（对于包含大约 10,000 个节点的顶点的生物网络）。如果您有一个小网络，您可以降低此数字，如果您有一个非常大的网络，您可以增加它。您可以在文章中找到更多信息，但简单来说，这个想法是要有足够数量的节点，尤其是更高程度的节点，以便能够对具有高程度的“等效”节点进行采样。

距离度量：默认情况下，包装脚本使用“最近”度量。要使用其他措施，您必须恢复到底层更原始的network_utilities 功能。在上面包装器中的 calculate_proximity 函数中，您可以将“calculate_closest_distance”与“get_separation”函数交换，您可以在其中提供“distance”作为参数（可能的值，“closest”，“shortest”，“kernel”，“center”， “jorg-closest”，对应文章中提到的5个措施）。

nodes_to vs nodes_from：使用药物靶点作为“nodes_from”和疾病基因作为“nodes_to”是一个很好的做法，前提是药物靶点的数量少于疾病基因的数量（对于我们的复杂疾病就是这种情况）正在使用）。请注意，这些度量不是对称的，如果交换节点集，结果会有所不同。

希望这可以澄清您的疑问，如果您需要进一步澄清，请告诉我。

埃姆雷

After reading your conversation, according to your method, I tried to calculate these values "closest", "shortest", "kernel", "center", "jorg-closest", but there was an error during the operation, I would like to ask you again, is there anything else to pay attention to when calculating these values? Looking forward to your reply

emreg00 commented 2 years ago

Hi Xiang,

It would help to understand the error you are getting, can you please share the error message?

Best,

Emre

On Sat, Mar 26, 2022 at 9:08 AM xianggeshamingzi @.***> wrote:

嗨乔迪，

不用担心，请在下面找到我对您的问题的评论：

d：观察到的两组节点之间的距离。

z：比较观察距离 (d) 与使用随机节点集计算的距离的背景分布计算得出的 z 分数。根据您要解决的问题，您可以考虑严格的截止值，例如 z < -2（两个节点集非常接近）或宽松的截止值，例如 z < -0.5（两个节点集更接近但统计上不重大）。在本文的分析中，我们选择了后者，因为药物往往具有较低的靶标特异性，并且药物靶标与生物网络中的许多蛋白质相互作用，这些蛋白质已知是小世界（可以到达网络中的任何其他节点）几步）。

min_bin_size：用于将随机节点集中节点的度数与为其计算d的节点的度数进行匹配的bin的最小大小（最小节点数）。通常，100 的大小足以具有用于随机选择的代表性节点集（对于包含大约 10,000 个节点的顶点的生物网络）。如果您有一个小网络，您可以降低此数字，如果您有一个非常大的网络，您可以增加它。您可以在文章中找到更多信息，但简单来说，这个想法是要有足够数量的节点，尤其是更高程度的节点，以便能够对具有高程度的“等效”节点进行采样。

距离度量：默认情况下，包装脚本 https://github.com/emreg00/toolbox/blob/3d707221092937ed45f13409f8cee38b8dc0316e/wrappers.py#L536 使用“最近”度量。要使用其他措施，您必须恢复到底层更原始的network_utilities 功能 https://github.com/emreg00/toolbox/blob/3d707221092937ed45f13409f8cee38b8dc0316e/network_utilities.py#L714。在上面包装器中的 calculate_proximity 函数中，您可以将“calculate_closest_distance”与“get_separation”函数交换，您可以在其中提供“distance”作为参数（可能的值，“closest”，“shortest”，“kernel”，“center”， “jorg-closest”，对应文章中提到的5个措施）。

nodes_to vs nodes_from：使用药物靶点作为“nodes_from”和疾病基因作为“nodes_to”是一个很好的做法，前提是药物靶点的数量少于疾病基因的数量（对于我们的复杂疾病就是这种情况）正在使用）。请注意，这些度量不是对称的，如果交换节点集，结果会有所不同。

希望这可以澄清您的疑问，如果您需要进一步澄清，请告诉我。

埃姆雷

After reading your conversation, according to your method, I tried to calculate these values "closest", "shortest", "kernel", "center", "jorg-closest", but there was an error during the operation, I would like to ask you again, is there anything else to pay attention to when calculating these values? Looking forward to your reply

— Reply to this email directly, view it on GitHub https://github.com/emreg00/proximity/issues/2#issuecomment-1079712432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYWVFJZWSW3TOH5MVG7LLDVB4R63ANCNFSM5KN72IGA . You are receiving this because you commented.Message ID: @.***>

emreg00 commented 2 years ago

From this partial info, it sounds like you might be using Python 3 where methods return generators rather than iterable objects as opposed to Python

Is that so?

Emre

On Sat, Mar 26, 2022 at 10:47 PM xianggeshamingzi @.***> wrote:

嗨翔，这将有助于理解你得到的错误，你能分享错误信息吗？最好的，埃姆雷 … <#m1461242237325980421> On Sat, Mar 26, 2022 at 9:08 AM xianggeshamingzi @.*> wrote: 嗨乔迪，不用担心，请在下面找到我对您的问题的评论： - d：观察到的两组节点之间的距离。 - z：比较观察距离 (d) 与使用随机节点集计算的距离的背景分布计算得出的 z 分数。根据您要解决的问题，您可以考虑严格的截止值，例如 z < -2（两个节点集非常接近）或宽松的截止值，例如 z < -0.5（两个节点集更接近但统计上不重大）。在本文的分析中，我们选择了后者，因为药物往往具有较低的靶标特异性，并且药物靶标与生物网络中的许多蛋白质相互作用，这些蛋白质已知是小世界（可以到达网络中的任何其他节点）几步）。

min_bin_size：用于将随机节点集中节点的度数与为其计算d的节点的度数进行匹配的bin的最小大小（最小节点数）。通常，100 的大小足以具有用于随机选择的代表性节点集（对于包含大约 10,000 个节点的顶点的生物网络）。如果您有一个小网络，您可以降低此数字，如果您有一个非常大的网络，您可以增加它。您可以在文章中找到更多信息，但简单来说，这个想法是要有足够数量的节点，尤其是更高程度的节点，以便能够对具有高程度的“等效”节点进行采样。

距离度量：默认情况下，包装脚本 https://github.com/emreg00/toolbox/blob/3d707221092937ed45f13409f8cee38b8dc0316e/wrappers.py#L536 https://github.com/emreg00/toolbox/blob/3d707221092937ed45f13409f8cee38b8dc0316e/wrappers.py#L536 使用“最近”度量。要使用其他措施，您必须恢复到底层更原始的network_utilities 功能 https://github.com/emreg00/toolbox/blob/3d707221092937ed45f13409f8cee38b8dc0316e/network_utilities.py#L714。在上面包装器中的 https://github.com/emreg00/toolbox/blob/3d707221092937ed45f13409f8cee38b8dc0316e/network_utilities.py#L714%3E%E3%80%82%E5%9C%A8%E4%B8%8A%E9%9D%A2%E5%8C%85%E8%A3%85%E5%99%A8%E4%B8%AD%E7%9A%84 calculate_proximity 函数中，您可以将“calculate_closest_distance”与“get_separation”函数交换，您可以在其中提供“distance”作为参数（可能的值，“closest”，“shortest”，“kernel”，“center”， “jorg-closest”，对应文章中提到的5个措施）。 - nodes_to vs nodes_from：使用药物靶点作为“nodes_from”和疾病基因作为“nodes_to”是一个很好的做法，前提是药物靶点的数量少于疾病基因的数量（对于我们的复杂疾病就是这种情况）正在使用）。请注意，这些度量不是对称的，如果交换节点集，结果会有所不同。希望这可以澄清您的疑问，如果您需要进一步澄清，请告诉我。埃姆雷 After reading your conversation, according to your method, I tried to calculate these values "closest", "shortest", "kernel", "center", "jorg-closest", but there was an error during the operation, I would like to ask you again, is there anything else to pay attention to when calculating these values? Looking forward to your reply — Reply to this email directly, view it on GitHub <#2 (comment) https://github.com/emreg00/proximity/issues/2#issuecomment-1079712432>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYWVFJZWSW3TOH5MVG7LLDVB4R63ANCNFSM5KN72IGA https://github.com/notifications/unsubscribe-auth/AAYWVFJZWSW3TOH5MVG7LLDVB4R63ANCNFSM5KN72IGA . You are receiving this because you commented.Message ID: @.*>

[image: 屏幕截图 2022-03-27 122009] https://user-images.githubusercontent.com/94523516/160266962-ed97f0b2-76e8-4dae-acfb-6f57e7e6cdf5.png Thank you very much for your reply, because of my lack of knowledge of python, to bother you again, I will make some changes to the calculate_proximity function in the wrappers you said, but there is an error when running, the specific error is: TypeError: can't pickle generator objects, may this be caused by the incorrect calculate_proximity function in my modified wrappers?

— Reply to this email directly, view it on GitHub https://github.com/emreg00/proximity/issues/2#issuecomment-1079838448, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYWVFJ624K3STRQRNC72K3VB7R6HANCNFSM5KN72IGA . You are receiving this because you commented.Message ID: @.***>

xianggeshamingzi commented 2 years ago

Hi Emre Thank you so much for your patience, I'm sure I'm using the Python2 version, before that I managed to calculate the "closest" measure by default, but I'm having trouble trying to calculate several other distance values. I was wondering if there was an error in the calculate_proximity function of the wrappers, where I swapped "calculate_closest_distance" with the "get_separation" function and provided a parameter for "distance", sorry to bother you again.

emreg00 commented 2 years ago

Hi again,

The TypeError you mentioned originates from pickling of the shortest path lengths dictionary, which in principle isnt related to the proximity calculation function (it can be calculated outside and given as an argument to the function). Perhaps you can provide the full error message (with the function call stack and line numbers).

In any case, I have just updated the wrappers.py in toolbox repo to add distance as a parameter in the calculate_proximity function. Let me know if this helps.

Emre

xianggeshamingzi commented 2 years ago

Hi Emre, Thanks for updating the wrappers for my question. I used the wrappers you just updated and I succeeded in calculating the result with your default distance="closest", but when I try to change distance="closest" to distance="shortest" or "kernel" or "center", there is still an error, the error appears in the lengths = sp[geneid] of line 473 of the network_utilities function, TypeError: 'NoneType' object has no attribute 'getitem'. Then I tried to add: lengths = network_utilities.get_shortest_path_lengths(network, "D:\test_juli\toy.sif.pcl") to the calculate_proximity function in wrappers, the error appeared in cPickle.dump(val, open(dump_file)'w')) on line 138 of the network_utilities function , TypeError: can't pickle generator objects.Thank you so much for being so patient in helping me solve the problem.

emreg00 commented 2 years ago

Hi Xiang,

I have located the issue which is likely stemming from the change in networkx function that returns a generator in the recent version of the package as opposed to a dictionary in the 1.x versions.

I have patched the code to convert the iterator to a dictionary for pickling to circumvent the issue and pushed an updated version, let me know if it works for you now. I have also added an example on how to run proximity with a non-default measure (such as shortest) to the toolbox package's README.

Best wishes,

Emre

xianggeshamingzi commented 2 years ago

Hi Emre, Thank you so much, my lack of words may not express my gratitude. All in all, I really appreciate your toolbox contribution, as well as your detailed responses and all kinds of help. All my problems have been resolved with your patient help. I wish you all the best. Xiang

emreg00 commented 2 years ago

Hi Xiang,

I am glad that it worked and you find it useful. Thanks for your kind words and wish you best with your work.

Emre.

emreg00 / proximity

Apply closest distance #2