MKLab-ITI / JGNN

A Fast Graph Neural Network Library written in Native Java
Apache License 2.0
17 stars 3 forks source link

Parallel executions #9

Open tinca opened 2 months ago

tinca commented 2 months ago

Hello,

While experimenting with differently sized graphs I observed that when threadpool is used it only works for graphs of the same size. If it is documented somewhere its fine, although ThreadPool does not mention it. Otherwise it would be nice to have mentioned somewhere. It would be nice still to be able to use parallelization. What comes to mind first is to submit in runs of same-sized graphs, stop at when they consumed, and repeat. Or there exists better?

When realized the above, I finally was able to use my dataset, although the model taken from SortPooling still cannot predict. In relation to that, how can edge features be used? I found using values representing a feature in adjacency matrix does not work, it is presumably not designed for that.

maniospas commented 2 months ago

Hi and thanks for the report!

This is actually a bug; parallelization is meant to be performed independently for each batch so it should not have leaking desired dimensions to other batches. I replicated the issue and will investigate it.

It may take a couple of weeks before this is addressed, because I am also looking to make a transition to SIMD parallelization too, which takes higher priority as a more generic speedup (applies on graph classification too).

tinca commented 2 months ago

Hi, Thank you for looking into it. Not an urgent thing for me, I am very well at the experimenting, learning phase.

As to my question: can you point me to some docs, examples how I can include edge features?

maniospas commented 2 months ago

While looking into it, I found some hot spots for optimization that cut running times by a lot (I think by more than a factor of /4). I uploaded a nightly version that contains them. I am in the midst of improving CI, so JitPack will be failing on nightly releases for a couple of days. But, if you want, you can get the improvements from the following link (download the jar and add it as a dependency) : https://github.com/MKLab-ITI/JGNN/releases/tag/v1.3.12-nightly

You will need the improved version to run something like message passing with edge features: you basically need to "unwrap" the graph into U | V | F where U are the source node embeddings, V the destination node embeddings, and F the corresponding edge embeddings (obtained as some transformation of edge features). | is the horizontal concatenation. A strategy to create U | V is described in the message passing tutorial together with how we can go back to the node feature space, or obtain a graph with modified weights (with neighbor attention).

Honestly, I just don't have a convenient dataset to set up a corresponding tutorial, so I would appreciate any suggestion to that front. Previously I was also stuck on the running speed, but with the latest improvements things run at a reasonable speed (I dare say even fastly if hidden dimensions are small).

Finally, if you don't have node features but only have edge features, I believe obtaining an equivalent line graph can help you run traditional GNNs for your task.

I hope these help. I will get back to this thread once I have progress more.

P.S., convergence is a bit slow with graph pooling. But you see some epochs with improvements that quickly become worse it will eventually converge. This is an example if I serialize the sort pooling example:

iter = 0  0.507
iter = 1  0.507
iter = 2  0.5075
iter = 3  0.51225
iter = 4  0.756
iter = 5  0.5
iter = 6  0.5
iter = 7  0.5
iter = 8  0.7775
iter = 9  0.6215
iter = 10  0.5665
iter = 11  0.5575
iter = 12  0.566
iter = 13  0.60625
iter = 14  0.66925
iter = 15  0.78525
iter = 16  0.7505
iter = 17  0.73425
iter = 18  0.79775
iter = 19  0.78325
iter = 20  0.71225
iter = 21  0.66575
iter = 22  0.65825
iter = 23  0.66075
iter = 24  0.6745
iter = 25  0.7315
iter = 26  0.7755
iter = 27  0.787
iter = 28  0.78875
iter = 29  0.78575
iter = 30  0.78
tinca commented 2 months ago

Hi maniospas,

First of all, thank you again for promptly adressing my issues and I am glad to hear about the coming improvements. For a few days, I'll be dealing with other stuff, so no progress can be expected on my part.

Thanks for the messaging reference. Frankly, I am a complete newbie to this field and feel a bit dumb seeing those terms so new to me, there's so much to digest :-). I'll think about possible datasets, those I am to work with needs to be checked for possible public usage. Node features are relevant to my cases, just as edge features. It would be illusory to expect quick success on my part, I am prepared to proceed somewhat a trial-and-error way while picking up necessary basics. I just casually mentioned the no-convergence, considering how relevant edge features are for my case, it is no wonder. (It started around 0.05 and couldn't improve much after 400+ epoch.)

tinca commented 2 months ago

I came across a dataset (structure vs odor), free for non-commercial usage, although its download requires a request to the data owners. Found it in an oustandingly clear and useful writeup: A Gentle Introduction to Graph Neural Networks.

tinca commented 2 months ago

Found an easily and freely accessible lipophilicity dataset. I can help with constructing the skeleton for this tutorial, however the NN architecture part would be too much for me for now.

maniospas commented 2 months ago

Hi tinca. firstly, I am also thankful for all the feedback. 😄

For a dataset that needs a request to data owners, I imagine that we will have issues creating a public version that everyone can download for out-of-the-box testing. However, the second dataset you mention seems really useful! (Bonus points that it's small.)

Since I come from a completely unrelated domain, can you explain a bit how entries like Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14 should be converted to a graph?

I would be interested in receiving a tutorial skeleton from you (especially a short introduction to the problem that non-domain-experts can understand). I can fill in the learning part without issue probably. I will know for sure where this should be placed in the repository from next week on, So we can start from there. (As a rough idea, I am thinking of creating a couple of practical application tutorials, and have a central guidebook for a formal presentation of JGNN's capabilities.)

tinca commented 2 months ago

Hi maniospas,

More feedback to come later as I progress...

Hope that dataset will prove to be usable. If not, there are more possibilities I can look for. The entry you quoted is a SMILES string a special description of a chemical structure. For a quick visualization see this app. I use certain libraries to import the SMILES and create an in-memory domain model representation out of it. From this model the adjacency matrix can easily be constructed. It was at this point when I thought I could contribute the skeleton for a tutorial in this domain. Your plan for the practical application tutorial sounds good and I am glad that this example can join that initiative. I am going to start to work on this example until you find out the best place for them within the project.

maniospas commented 2 months ago

Hi again. This is a very nice prospect. I am mostly done with all the refactoring and ready to tackle automated graph classification. I will update here once I have a directory with some first domain-specific tutorials (you can then fork the project, add yours up to the point where graphs are created, and create a pull request).

If I understood correctly, for edge features you are mostly limited to a couple of edge labels that correspond to bond types, right?

These are actually easier to parse by storing each bond in a different adjacency matrix, given that there only a few bond types. Then it's just a matter of aggregating all bond types (e.g., with DistMult) - I don't know if this is a standard already in molecular analysis, but for sure is common in knowledge graphs.

Actually, this will be very convenient as an intermediate step before "for" statements are added in Neuralang. This is how I plan to implement them for a general case, but this might take a while:

fn transE(h, t, r) {
    return L1(h-t+r, dim: 'row');
}

fn transEAffine(h, t, r, hidden: 16) {
    h = matrix(hidden)*h + vector(hidden);
    t = matrix(hidden)*t + vector(hidden);
    return transE(h, t, r);
}

fn layer(A, feats, r) {
    u = from(A);
    v = to(A);
    h = feats[u];
    t = feats[v];
    return transEAffine(h, t, r);
}

fn layer([A], feats) { // [A] indicates a list of adjacency matrixes here
    x = nothing();
    for A {
        r = vector(hidden);
        x = x + layer(A, feats, r);
    }
    return x/len(A);
}

fn architecture([A], feats) {
    feats = layer([A], feats);
    feats = relu(feats);
    feats = layer([A], feats);
    return feats;  // add some operations for graph pooling here
}
maniospas commented 2 months ago

Hi again @tinca I started a tutorial folder here: https://github.com/MKLab-ITI/JGNN/tree/main/tutorials You can start adding your material in that folder. Don't forget to add a tentative summary in the folder's readme and credit yourself in the tutorial. Please open a different issue to discuss this contribution.

On other news, I added automation for graph classification, and fixed parallel execution. I've been painfully slow on creating good docs. but I'm mostly there. For the time being, I suggesting using the latest nightly version with this example as a reference: https://github.com/MKLab-ITI/JGNN/blob/main/JGNN/src/examples/graphClassification/SortPooling.java

P.S. I don't know if there are too many nightly build notifications, in which case it may be a good idea for me to find a different nightly build strategy.

tinca commented 2 months ago

Hi maniospas,

If I understood correctly, for edge features you are mostly limited to a couple of edge labels that correspond to bond types, right?

Yes, exactly.

These are actually easier to parse by storing each bond in a different adjacency matrix, given that there only a few bond types.

I don't get this. For example formic aldehyde's structure is H-(H-)C=O. Does it mean you have two matrices containg all atoms, one for single, and one for double bonds?

tinca commented 2 months ago

Thanks for the tutorial placeholder. I got a little busy lately, but will catch-up soon. Also, have seen nightly jars can be got from jitpack together with sources that makes local development easier. I am yet to get accustomed jitpack UI though :-). A note: I observed that the name part (JGNN) of the nightly artifacts changed to lower-case. If you mean javadocs by "creating good docs", this could also be offered as a downloadable artifact, although for development sources are good enough, they come together then :-) I do not get notifications about nightly builds, so that is not a problem.

maniospas commented 1 month ago

Yes, you'd have one matrix for the single bonds and one for the double ones. It is no issue that the double bonds could be exceptionally sparse. If we call the bond matrices A and B, we could then write layers like relu(A@h{l}@matrix(hidden, hidden) + B@h{l}@matrix(hidden, hidden) + vector(hidden)) to make each node leverage information from its neighbors while accounting for the type of bond in a different way.

Once you have the graph loading in place in your tutorial I can fill the architecture.

Thanks for letting me know about the nightly artifacts. There's an online page for Javadoc here: https://mklab-iti.github.io/JGNN/javadoc/ For good docs, I'm referring to the whole package of having a introductory guide, javadoc for (eventually) all components, and practical examples. (The previous tutorial was too abstract for my liking because it presented abstract code that was hard to run immediately and went too much into technical details.)

Glad to know that source documentation makes development easier. I don't completely follow what you are doing, though: nightly jars are released via GitHub itself. Do they not have Javadoc attached, e.g., to show when you mouse-over them in Eclipse? Admittedly I only tested that this works when adding the Jitpack dependency in the pom as described here: https://jitpack.io/#MKLab-ITI/JGNN/v1.3.33-nightly

tinca commented 1 month ago

The way bond types are represented as features is clear now, just as good docs. About source artifacts: I am using gradle and IDEA. To get the sources artifact it is either expected be defined as an explicit dependency in gradle (or implicitely, using the new IDEA plugin for gradle appropriately configured), or IDEA offers its downloading on first encounter. Now, that these artifacts are published on jitpack, any of the above works for me.

I referred to jitpack UI, because there is a lookup function which lists available versions, and at first sight it was not obvious for me what "Get it" button meant. But it's OK now.

Once you have the graph loading in place in your tutorial I can fill the architecture. I'll try in the coming days in the little time I have recently. If I cannot get to the above stage, I'll be away next week, so can only work on it after that. Till then a question: are you OK with using 3rd party packages for the sake of a tutorial? Creating graphs requires importing from some chemical file formats, or I may need to write my own simple code for that.

maniospas commented 1 month ago

Nice. :-)

For the sake of a tutorial it's fine to use whatever library as well as its setup is included in the "Setup" section. Actually, I believe this is better because it gives a fully fleshed practical example.

That said, if you have the time to write a simple importer (even if independently), I could also use it as a basis to create an out-of-the-box dataset that can be imported for fast experimentation from the main library.

tinca commented 1 week ago

Hi maniospas,

FYI: I've sent you a message to your @hotmail days ago.