arthurcgusmao / SFE

Implementation of Subgraph Feature Extraction, a variant of the Path Ranking Algorithm for KB Completion.
8 stars 4 forks source link

Link Prediction Part #3

Closed ezio0218 closed 3 years ago

ezio0218 commented 3 years ago

After I use SFE to extract features, I wanna do the link prediction between any two given entities in my kb. For this part I just feel puzzled. All the features are of different length and the form of the features are so complex with a lot of words, how can I use them

arthurcgusmao commented 3 years ago

Hi @ezio0218 , please see my comment on https://github.com/arthurcgusmao/SFE/issues/1#issuecomment-423589988

DISCLAIMER

This repo has not been fully tested, it is an incomplete implementation and should NOT be relied upon.

If you would like to contribute to this repo, PRs are super welcome.

ANSWER

After extracting the features you should build a binary feature matrix for each relation of the knowledge graph where each line of the matrix regards a tuple (object, subject) and each column corresponds to the existence of a path on the graph present between the object and the subject. Then you can use this binary feature matrix to train any kind of supervised learning classification model, where the feature matrix are the feature variables and the existence of the original relation of your interest is the target variable. See page 42 of my master thesis: https://teses.usp.br/teses/disponiveis/3/3141/tde-04022019-094854/publico/ArthurColombiniGusmaoCorr18.pdf

For further reference, please see the original papers on SFE and PRA:

ezio0218 commented 3 years ago

Ohhh thank you so much!!! @arthurcgusmao So my next steps will be

  1. build feature matrix
  2. my lines will be my entity pairs
  3. my columns will be the existence of each ordered path combination or sequence Thus if I have a lot of relations in my kg, the complexity and feature dimentions will explode. Am I right? So will it be a better option if I do the link prediction using some knowledge graph embedding method and then go back to use SFE classification to do the interpreting job (predicting possible paths) between all the 'suspious entity pairs' I found?
arthurcgusmao commented 3 years ago

Yes that's right, but since SFE usually performs a breadth-first search limited by the number of relations it will expand, the number of features (or columns) remains limited.

If you are planning on doing link prediction, then it's unlikely you'll find "suspicious entity pairs" because link prediction usually involves considering only the entities that are best ranked. For this task, embedding models perform quite well (but so does SFE). It may even be that using SFE straight on the dataset will perform better than embedding models; in such a case it is usually the best option given that you'll have the best performance with an interpretable model.

If you want to explain the predictions of an embedding model, you might as well use the package XKE: https://github.com/arthurcgusmao/XKE

There, you can also find Matt Gardner's original PRA code as a dependency for extracting features using SFE, which is more reliable than this repository (which I did not finish testing -- it was a tangent idea).

ezio0218 commented 3 years ago

Thanks really a lot! @arthurcgusmao And one last question, if my edges and my entities have attributes, which means I need to treat them seperately according to their atrributes. Can I still use SFE method? I think it will just fail to work because the number of possible ordered path sequence will be too much to use. Is there a possible way to deal with such situation? Or no matter how large the martix will be, as long as the dimensions are limited, SFE can be used?

arthurcgusmao commented 3 years ago

I would say the bottleneck in having too many possible paths among nodes (entities) in your KG is mostly the amount of time that SFE will take to run, rather than the supervised learning model you will use on the feature matrix. So if you manage to run SFE, in general, you should be fine from that point onwards.

Can you please specify what you mean by both edges and entities having attributes? What do you exactly mean by attributes in this case and what form can these attributes take for each of them (edge and entity)?

ezio0218 commented 3 years ago

Yes that's right, I have implemented SFE and right now the speed is quite unacceptable.

Well it's a dataset consists of a lot of APPs, their components and the call relations among them. The relations between apps and components is that one app can have many components. And the relations between apps are the first app can call another app by the later app's specific component.

In that case, the relations will be 'app_1 -> callby(cpn_name) -> app_2', 'cpn_name -> belongTo -> app_name'. Because the cpns can be quite a lot, which means the possible paths can be quite a lot. At last, some of the apps are malware and the others remain unknown and I want to predict the most dangerious or suspicious call paths combinations and make them as rules to interpret reasons of the apps being malware. That gives the third relation: 'app_name -> is -> 'Malware''. I also want to find some other possible malware apps somehow, based on rules I found or embedding.

So the edges have attribute 'cpn_name'. And I plan to use Neo4j to build my kg in the future.

arthurcgusmao commented 3 years ago

That's a very interesting application, thanks for sharing @ezio0218 . I guess I can't contribute much to the details of the domain, but what I can say is that pretty much all KG embedding models and PRA and SFE have been built to work with KGs that have many different relations (such as NELL, WORDNET, etc.)

One thing that I would do though, is to try to reevaluate the domain modeling. By this, I mean trying to come up with a different way of structuring your entities and relations (edges). Particularly, doing so in a way that helps the assumptions make by the algorithm of your choice (be it SFE, PRA, embeddings, etc.).

For instance, does it really make sense to have one different edge call_by_(cpn_name) for each component you have? Would it be easier to instead link the component itself directly to the app it's calling? For instance:

# Notation implies the HEAD, RELATION, TAIL order

# Instead of
cpnA, belongTo, app1
cpnB, belongTo, app1
cpnC, belongTo, app2
cpnD, belongTo, app2
app2, callBy_cpnA, app1
app2, callBy_cpnB, app1
app1, callBy_cpnC, app2
app1, callBy_cpnD, app2

# You could try
cpnA, belongTo, app1
cpnB, belongTo, app1
cpnC, belongTo, app2
cpnD, belongTo, app2
app2, callBy, cpnA
app2, callBy, cpnB
app1, callBy, cpnC
app1, callBy, cpnD

I.e., the idea is to try to frame the problem in a way that helps the algorithm to learn.