Closed daniel4x closed 2 months ago
Hi @migalkin Any chance you've had the opportunity to look into this PR?
Hi Daniel, thanks for the PR, it indeed touches upon one of the most requested features!
The new features are pretty nice but I have hard time reproducing them because there is no link to the pickle file with the precomputed LLM features. Generally, in order to be merged with the main branch, the PR will need a bit more work to encompass more datasets and possible usecases:
RedHatCVE
. We'll need a unified mechanism for obtaining LLM features for all possible datasets if they have relevant entity / relation descriptions.self.lm_vectors = nn.Embedding.from_pretrained(kwargs['lm_vectors'], freeze=True)
is rather inefficient - it will be tedious to pretrain a model on several datasets because you'll have to swap that register all the time for each random dataset. PyG has a built-in mechanism of adding node features into the Data object (eg, as Data(x=llm_feature, edge_index=...)
and I believe it would be a much better interface to read the features from the relevant data object together with edge index and edge type.
Let me know what you think. We also have an ongoing effort in a similar direction, it might be ready in a month or so.
@migalkin thanks.
Indeed, I agree. In my case, this branch was tailor-made for my research, as I was using a single graph with LM. I am glad to hear about the ongoing effort and am looking forward to learning more about it (and even contributing when/if possible 😉).
I believe you can close the PR, or you can leave it open for future reference.
ULTRA-LM
Hi @migalkin , per my issue here I finally had some time to refactor part of my code and pack it into a PR.
It is worth mentioning that I've also contributed an additional dataset combining KG and textual embeddings. The embeddings are currently provided as a separate download link.
Changes:
README.md
instructionspretrain.yaml
- configuration with the path to LM vectorspretrain
script - I slightly changed the original pretrain to load the embeddings from a given pathRedHatCVE
- a cybersecurity vulnerability dataset. To allow correct mapping between the entity labels and the embeddings, I explicitly added the mapping to the Data object (a bit ugly, I know 😢 ).