benedekrozemberczki / karateclub

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)
https://karateclub.readthedocs.io
GNU General Public License v3.0
2.13k stars 244 forks source link

graph2vec implementation and graphs with missing nodes #47

Closed kajocina closed 3 years ago

kajocina commented 3 years ago

Hi there,

first of all, thanks a lot for developing this, it has potential to simplify in-silico experiments on biological networks and I am grateful for that!

I have a question related to the graph2vec implementation. The requirement of the package for graph notation is that nodes have to be named with integers starting from 0 and have to be consecutive. I am working with a collection of 9.000 small networks and would like to embed all of them into an N-dimensional space. Now, all those networks consist of about 25.000 nodes but in some networks these nodes (here it's really genes) are missing (not all genes are supposed to be present in all networks).

If I rename all my nodes from actual gene names to integers and know that some networks don't have all the genes, I will end up with some networks without consecutive node names, e.g. there will be (..), 20, 21, 24, 25, (...) in one network and perhaps (...), 20, 21, 22, 24, 25, (...) in another. That would violate the requirement of being consecutive.

My question is: is the implementation aware that a node 25 is the same object between the different networks? Or is it not important and in reality the embedding only takes into account the structure only and I should 'rename' all my networks separately to keep the node naming consecutive?

benedekrozemberczki commented 3 years ago

What is the exact ML task that you want to solve?

kajocina commented 3 years ago

I have a matrix M of the form NxD where N is the number of samples (9000) and D are the features (in this case genes). On top of this I am using a graph G of interactions between the genes as sort of a 'prior knowledge'.

Essentially I would like to represent each sample as a graph of those interactions and embed the samples into, say, 128-dimensional space. The one trick I am using here is that I look at the values in the initial matrix M, go to the graph G and remove the edge between nodes (genes) which are 'behaving differently' in the matrix M. By this I end up with a collection of 9000 very slightly altered graphs. Similar samples should have similar edges removed.

Now the embeddings will serve as an input to downstream ML such as unsupervised clustering and supervised classification using, say, Random Forests. Many of the samples have labels but many don't (welcome to biomedicine!) and therefore I wanted to run graph embedding in an unsupervised way.

Finally if the classifier based on learned embeddings works well, I would like to use the model for future samples that will be acquired.

Does this make sense?

benedekrozemberczki commented 3 years ago

Yes, it makes sense. Your problem could have 2 types of solutions:

  1. A graph convolutional neural network with a node-level loss function (semi-supervised learning).
  2. An inductive attributed node embedding algorithm with a supervised downstream model.

Generally speaking, graph2vec solves a very different problem which is not a node-level ML problem. Is this an industry problem?

On Wed, 26 Aug 2020 at 12:25, Piotr Grabowski notifications@github.com wrote:

I have a matrix M of the form NxD where N is the number of samples (9000) and D are the features (in this case genes). On top of this I am using a graph G of interactions between the genes as sort of a 'prior knowledge'.

Essentially I would like to represent each sample as a graph of those interactions and embed the samples into, say, 128-dimensional space. The one trick I am using here is that I look at the values in the initial matrix M, go to the graph G and remove the edge between nodes (genes) which are 'behaving differently' in the matrix M. By this I end up with a collection of 9000 very slightly altered graphs. Similar samples should have similar edges removed.

Now the embeddings will serve as an input to downstream ML such as unsupervised clustering and supervised classification using, say, Random Forests. Many of the samples have labels but many don't (welcome to biomedicine!) and therefore I wanted to run graph embedding in an unsupervised way.

Finally if the classifier based on learned embeddings works well, I would like to use the model for future samples that will be acquired.

Does this make sense?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benedekrozemberczki/karateclub/issues/47#issuecomment-680847041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETMFYIPY53GXSIRDPZ5O3SCT5KRANCNFSM4QLXE3YQ .

kajocina commented 3 years ago

Thanks for your tips! Are you aware if any of those algorithms are implemented somewhere in Python or R? The reason is that I just wanted an initial proof of principle that this is a good way to go (as opposed to spending now few months implementing the algorithm just to learn it's not going to work well :-) ).

And generally, this is a common industry problem, i.e. how do you represent biological samples in a more meaningful way. I myself work for a company but this is a purely academic exploration at the moment that, if works, could be used in a scientific paper.

benedekrozemberczki commented 3 years ago

PyTorch Geometric for supervised models. Node embedding - my model FEATHER can be generalized.

Bests,

Benedek

On Wed, 26 Aug 2020 at 14:09, Piotr Grabowski notifications@github.com wrote:

Thanks for your tips! Are you aware if any of those algorithms are implemented somewhere in Python or R? The reason is that I just wanted an initial proof of principle that this is a good way to go (as opposed to spending now few months implementing the algorithm just to learn it's not going to work well :-) ).

And generally, this is a common industry problem, i.e. how do you represent biological samples in a more meaningful way. I myself work for a company but this is a purely academic exploration at the moment that, if works, could be used in a scientific paper.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benedekrozemberczki/karateclub/issues/47#issuecomment-680903158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETMFZU6M7HQHCSZAUSVALSCUJQNANCNFSM4QLXE3YQ .