daniel4x commented 3 months ago

ULTRA-LM

Hi @migalkin , per my issue here I finally had some time to refactor part of my code and pack it into a PR.

It is worth mentioning that I've also contributed an additional dataset combining KG and textual embeddings. The embeddings are currently provided as a separate download link.

Changes:

README.md instructions
pretrain.yaml - configuration with the path to LM vectors
Custom pretrain script - I slightly changed the original pretrain to load the embeddings from a given path
A new entity model - based on the NbfEntity model, here we introduce the new combining layer
New dataset RedHatCVE - a cybersecurity vulnerability dataset. To allow correct mapping between the entity labels and the embeddings, I explicitly added the mapping to the Data object (a bit ugly, I know 😢 ).

daniel4x commented 3 months ago

Hi @migalkin Any chance you've had the opportunity to look into this PR?

migalkin commented 3 months ago

Hi Daniel, thanks for the PR, it indeed touches upon one of the most requested features!

The new features are pretty nice but I have hard time reproducing them because there is no link to the pickle file with the precomputed LLM features. Generally, in order to be merged with the main branch, the PR will need a bit more work to encompass more datasets and possible usecases:

Currently, the PR changes some of the inner workings of the datasets mostly in favor of one new dataset RedHatCVE. We'll need a unified mechanism for obtaining LLM features for all possible datasets if they have relevant entity / relation descriptions.
Putting node features inside the GNN like here
```
self.lm_vectors = nn.Embedding.from_pretrained(kwargs['lm_vectors'], freeze=True)
```
is rather inefficient - it will be tedious to pretrain a model on several datasets because you'll have to swap that register all the time for each random dataset. PyG has a built-in mechanism of adding node features into the Data object (eg, as Data(x=llm_feature, edge_index=...) and I believe it would be a much better interface to read the features from the relevant data object together with edge index and edge type.
In many datasets, relations might also have their descriptions which could be encoded by LLMs as well. So we'll need a mechanism to optionally process those if they exist.

Let me know what you think. We also have an ongoing effort in a similar direction, it might be ready in a month or so.

daniel4x commented 3 months ago

@migalkin thanks.

Indeed, I agree. In my case, this branch was tailor-made for my research, as I was using a single graph with LM. I am glad to hear about the ongoing effort and am looking forward to learning more about it (and even contributing when/if possible 😉).

I believe you can close the PR, or you can leave it open for future reference.

DeepGraphLearning / ULTRA

ULTRA-LM: Language model integration for ULTRA #24

ULTRA-LM

Changes: