RoyZhengGao / edge2vec

Learning node representation using edge semantics
BSD 3-Clause "New" or "Revised" License
51 stars 22 forks source link

Peer Review of arXiv preprint version 2 for BMC Bioinformatics #1

Open dhimmel opened 5 years ago

dhimmel commented 5 years ago

Greetings, I recently reviewed the manuscript for this study titled edge2vec: Representation learning using edge semantics for biomedical knowledge discovery. I authored the review for BMC Bioinformatics. Since the manuscript looked to be the same as version 2 on arXiv (1809.02269v2), I figured it was okay to post my review publicly.

I posted my review on Publons and will copy it below. Hopefully, sharing the review will allow you to address the feedback sooner and provide you with the ability to follow up in this issue with questions if anything is unclear. Furthermore, perhaps others will find my assessment of the work valuable.

dhimmel commented 5 years ago

Review of edge2vec version 2 on arXiv

Overview

This study addresses an important problem, which is quantifying each node in a hetnet using a low-dimensional set of features (embeddings). The study correctly identifies that hetnets (networks with multiple node and relationship types) are an increasingly important data structure for modeling biomedical knowledge.

According to the authors, previous approaches, such as metapath2vec, fall short in terms of being able to generally produce embeddings on a hetnet. If my comprehension of this study is correct, the proposed method, called edge2vec, can produce embeddings given just a hetnet. While certain parameters must be optimized, prior knowledge of how to weight different types of nodes or edges in the network is not required. Therefore, edge2vec has the opportunity to be widely applied as a method since it can be largely automated. Meanwhile, hetnets are poised to become one of the primary data structures for encoding knowledge across many scientific domains. As such, I consider this work to be of extreme importance.

Questions

edge2vec handles type by constructing an edge-type transition matrix to guide random walks. I think this is a neat solution for dealing with edge heterogeneity for the purposes of creating embeddings. Random walks seem appropriate because as a versatile way to explore any area of the network. I was a bit unclear how the edge-type transition probabilities are optimized. It read to me like the transition probabilities were updated to match what was observed on the previous round of walks? However, wouldn't this favor transitioning to more plentiful edge types? I guess I'm looking for a simplified summary of how transition probabilities are optimized.

The writing style and prose are good overall. The presentation of complex ideas is straightforward for the most part. One improvement would be greater insertion of paragraphs. Presently, paragraphs are used sparingly, with whole sections consisting of one paragraph. The authors should consider whether more paragraphs would help break sections into discrete units and improve readability, especially in the introduction.

I would like additional discussion of what the node embeddings captures, even if it is partly speculation. My understanding is that they capture information on what nodes a given node is proximal to. What else could they possibly be encoding about the nodes? One thing we've found is that node degree often greatly affects the output of network methods. I would appreciate for the authors to analyze the relationship between metaedge-specific (and optionally total) node degree and embedded features. Oftentimes node degree is overlooked in network-based methods, despite being the single most important driver of results. Therefore, any additional assessment of how degree affects edge2vec that the authors can provide will be helpful.

Comments

Figure 2, which displays the Chem2Bio2RDF metagraph (schema), is a useful visualization. It would be more straightforward to interpret if the metanodes and metaedges were labeled directly, rather than having to cross-reference a legend to expand the labels. As an example, see Figure 1A of the Project Rephetio manuscript at https://doi.org/c3dj. I don't think the image would take any more space overall, were the legend removed and metanodes / metaedges labeled directly.

Regarding 4.3 Entity multi-classification, the authors evaluate the ability of embeddings to predict node types. However, they continue to include edge type information when generating the embeddings. This strikes me as odd, because edge types imply node types. Specifically, as long as a node has at least one edge with a known type, the type of that node can be trivially inferred. Now the authors claim:

Moreover, in all steps of edge2vec, we only use edge-type information during the random walk to generate edge-type transition metrics. We never involve node type information into our model.

However, if edge2vec is given access to edge types, whereas the other three methods are not, is this a fair comparison? Especially, since a supervised SVM model is used to predict node types from the embeddings, it seems that the supervised model could detect remnants of the transition probabilities that are encoded in the embeddings. More explanation is required as to why predicting node types makes sense when edge types are known and used as input.

Personally, I think it may be more a more powerful application to predict an attribute of a node rather than node type. For example, can embeddings be used to predict which compounds cause a certain side effect? Each oberservation would remain a single node, and presumably side effect (sider) nodes would be removed from the hetnet for this application. Basically, I envision more real-world tasks come down to predicting whether a node has a certain attribute/property rather than whether it's of a certain type, since I think type is often known a priori. As I feel the manuscript already performs several applications, I do not think this analysis is required to demonstrate the utility of edge2vec, but I propose it as a future application.

In section 4.4 the authors note that the "compound-gene-compound" metapath was used for metapath2vec. This initially confused me because it is underspecified as there are two metaedges that connect genes and compounds (although I don't know what they are because it's hard to go from color to label in Figure 2). In my opinion, a metapath is defined by its sequence of metaedges — not its metanodes. Then, I remembered the authors noted that "metapath2vec does not consider edge types rather only node types". Assuming this is true (I have not reviewed meatpath2vec), I find metapath2vec an unfortunate nomenclature for this pre-existing method, as the method is not capable of handing the general cases of metapaths. One major contribution on the present study (edge2vec) is that it overcomes several of the shortcomings of metapath2vec.

Perhaps I overlooked the definition of "metapath2vec++", but I couldn't find a description via ⌘F.

Would Table 4 be better expressed as ROC (and possible also PR) curves showing the performance of each classifier? I am not sure what the purpose of evaluating performance of binary predictions at a few thresholds rather than performance of continuous predictions accross all thresholds? In short, I think binary predictions are inferior to continuous probability predictions and that ROC or PR curves are more information rich than Table 4.

In my opinion, Figure 4 showing the 2D PCA projection of embeddings for 25 genes of 5 different gene families is a spectacularly simple visualization demonstrating that edge2vec works. Is the margin of this figure unnecessarily bulky?

Data availability

I didn't see the embeddings computed for each Chem2Bio2RDF node in the zipped data on GitHub. This dataset would be helpful for researchers who want to use the output of edge2vec, but without having to recompute it themselves. This data would be best preserved on a data archiving platform (such as Figshare or Zenodo) under an open license (preferably CC0).

Reproducibility

The version of each software depenency used to perform the study should be tracked somewhere.

I don't believe the code to perform all analyses are available. The authors should consider posting all code to perform the study.

Python code

The software to run edge2vec is available at https://github.com/RoyZhengGao/edge2vec, potentially making it possible for other users to run edge2vec on their own networks. Great!

The source code README states "The code is released under GNU license". This is insufficient to license the repository because GNU produces multiple licenses each with multiple versions. All of these licenses require a copy of the license be included with the source code. Instead or in addition to the README, a LICENSE file should be included in the repository. The authors should also consider whether a more permissive license than GPLv3 would be more appropriate. For example, the MIT or BSD 3-Clause License would be much more compatible with the existing ecosystem of Python software.

The Python modules do not include docstrings for functions. The important functions, especially those that are part of the external API, should have docstrings describing the parameters and output.

The Python software lacks a formal definition of dependencies and is not configured to be installable as a module. The authors should consider adding a setup.py to enable installation as a package (example).

I'm the author of the hetmatpy Python package (https://github.com/hetio/hetmatpy) for storing hetnets as matrices and performing certain algorithms. If the authors would like edge2vec to be integrated as a larger software ecosystem for hetnet computation, we would be happy to assist with adding the algorithm to hetmatpy. If so, opening a GitHub issue on hetmatpy would be a good place to start.

Acknowledgements

I would like to thank Ben Heil and David Nicholson, both graduate students in Genomics and Computational Biology at University of Pennsylvannia for their discussion related to neural-network-based embedding methods.

This review was performed by Daniel Himmelstein (@dhimmel on GitHub and Twitter).

RoyZhengGao commented 5 years ago

Thanks so much for your comment! I will consider your feedback and update both the paper writing and code part~ Thanks for the invitation to merge this code to hetmatpy. I will take a look at the package first and see whether there is a probability to put edge2vec inside. Thanks!

Greetings, I recently reviewed the manuscript for this study titled edge2vec: Representation learning using edge semantics for biomedical knowledge discovery. I authored the review for BMC Bioinformatics. Since the manuscript looked to be the same as version 2 on arXiv (1809.02269v2), I figured it was okay to post my review publicly.

I posted my review on Publons and will copy it below. Hopefully, sharing the review will allow you to address the feedback sooner and provide you with the ability to follow up in this issue with questions if anything is unclear. Furthermore, perhaps others will find my assessment of the work valuable.

RoyZhengGao commented 5 years ago

Thanks for your comments! We share our point-by-point response here: point by point response.docx

dhimmel commented 5 years ago

Thanks @RoyZhengGao for sharing your response to my initial review. Here is the review of the revised manuscript that I submitted to BMC Bioinformatics.


The authors provided an in-depth reply to my previous review and addressed many of my suggestions and criticisms. I think edge2vec will be a powerful method for heterogeneous network analysis, and applaud the authors on this work. I have some small remaining comments, which I expect the authors will be able to quickly address.

The edge types names in Figure 2 are not ideal. In general, edge types should be verbs, in order to signify the relationship. For example, there is an edge "gene-gene-disease", which would be more understandable as "gene-associates-disease". I am not sure whether the authors chose the edge type names, or inherited them from Chem2Bio2RDF. If the authors chose the edge type names, I think they could be improved. However, if they come from Chem2Bio2RDF, I understand the naming is an upstream issue.

I was not able to download embeddings at http://ella.ils.indiana.edu/~gao27/data_repo/edge2vec due to the error "The requested URL /~gao27/data_repo/edge2vec was not found on this server." I also don't see the benefit of hosting the dataset personally as opposed to using a free-of-charge service such as Figshare or Zenodo. Both of these services assign DOIs to datasets, include versioning, and will be more persistent and reliable than a lab server. In addition, these repositories require assigning a license metadata field. I highly recommend the CC0 public domain license for the embeddings dataset.

I have double checked the GitHub repository and confirm that a BSD 3-Clause License has been added.

RoyZhengGao commented 5 years ago

Thanks for sharing this! You can see my point-by-point response here:

  1. It would be helpful for the authors to post their written reply to my previous review on the GitHub Issue. This way readers will be aware of the responses. Response: Thanks for your question. We have attached our previous response to the opened issue in the github repository.

  2. The edge types names in Figure 2 are not ideal. In general, edge types should be verbs, in order to signify the relationship. For example, there is an edge "gene-gene-disease", which would be more understandable as  "gene-associates-disease". I am not sure whether the authors chose the edge type names, or inherited them from Chem2Bio2RDF. If the authors chose the edge type names, I think they could be improved. However, if they come from Chem2Bio2RDF, I understand the naming is an upstream issue. Response: Thanks for your question. Actually when we have the dataset, we have the same concern as yours. However, as the dataset comes from Chem2Bio2RDF, which is already publicly available, we have to keep the node-/edge-type names the same to maintain its consistency.

  3. I was not able to download embeddings at http://ella.ils.indiana.edu/~gao27/data_repo/edge2vec due to the error "The requested URL /~gao27/data_repo/edge2vec was not found on this server." I also don't see the benefit of hosting the dataset personally as opposed to using a free-of-charge service such as Figshare or Zenodo. Both of these services assign DOIs to datasets, include versioning, and will be more persistent and reliable than a lab server. In addition, these repositories require assigning a license metadata field. I highly recommend the CC0 public domain license for the embeddings dataset. Response: Thanks for your question. We forget to use “%20” to replace space in the url link, which is the reason why the embedding file can’t be downloaded. As you suggested, we store another copy to Figshare with the link https://figshare.com/articles/edge2vec_vector_zip/8097539 associated with CC0 license. We also mentioned it in the github repository.

  4. I have double checked the GitHub repository and confirm that a BSD 3-Clause License has been added. Response: Thanks for your comment! We have added the license based on your previous suggestions. Thanks!

dhimmel commented 5 years ago

Awesome. Looks like you've addressed all the issues in my latest review. I see the Fishare deposit is available with the following DOI: 10.6084/m9.figshare.8097539.v1.

RoyZhengGao commented 5 years ago

Yes the dataset has been uploaded there~