generate embeddings from multiple rdf/ttl

ilseva commented 2 years ago

Hi, thanks for sharing your works.

We would like to use jRDF2Vec to generate embeddings to have a base of knowledge for a semantic service engine. Our starting point is a custom ontology where some of the object properties refer to public vocabularies (in rdf format) like Frequency Vocabulary and some others to our custom vocabularies.

The approch we follow is:

store ontology in TTL file
download all public vocabularies in RDF format and store them in files
create our custom vocabularies and store them in TTL (is more simple to write) files
create "individuals" based on our ontology in TTL and store them in files
generate walks for each of the files above with java -jar jrdf2vec-1.2-SNAPSHOT.jar -graph <ttl_file|rdf_file> -onlyWalks -walkDirectory <custom_folder>
move the walk_file_0.txt.gz in a specific folder to avoid overwriting
merge walks files with java -jar jrdf2vec-1.2-SNAPSHOT.jar -mergeWalks -walkDirectory <specific_folder> -o <merged_walks>
generate embeddings with java -jar jrdf2vec-1.2-SNAPSHOT.jar -onlyTraining -light entities.txt -minCount 5 -walkDirectory <specific_folder>

Is this process the correct one? If not could you point out how to change it?

Furthermore, using the example of Jupyter Notebooks in your baseline we try to found most similar "concepts" in our model but we found the following unclear issue: if we build the query using keys that belongs to public vocabularies and to our individuals, the results we obtain refer only similar concepts of public vocabularies (expected concepts similar to our individuals seem not to be considered).

from gensim.models import KeyedVectors

kv_file = "../merged_walks/model.kv"
vectors = KeyedVectors.load(kv_file, mmap='r')

def closest(word_vectors: KeyedVectors, concepts: [str], negatives: [str]=None) -> None:
    print(f"Closest concept to: {concepts}")
    for other_concept, confidence in word_vectors.most_similar(positive=concepts, negative=negatives, topn=50):
        print(f"{other_concept} ({confidence})")

closest(word_vectors=vectors, concepts=[
      "https://our-custom-namespace/subjects/CustomSubject"
    , "http://publications.europa.eu/resource/authority/country/AUT"
    , "http://inspire.ec.europa.eu/theme/lc"])

Closest concept to: ['http://inspire.ec.europa.eu/theme/lc', 'http://publications.europa.eu/resource/authority/country/AUT', 'https://our-custom-namespace/subjects/CustomSubject']
http://publications.europa.eu/resource/authority/country/ESP (0.9480347037315369)
http://publications.europa.eu/resource/authority/country/CYP (0.9446530342102051)
http://publications.europa.eu/resource/authority/country/REU (0.9442076086997986)
http://publications.europa.eu/resource/authority/country/SWE (0.9437082409858704)
http://publications.europa.eu/resource/authority/country/ROU (0.9435302019119263)
http://publications.europa.eu/resource/authority/country/FIN (0.9430405497550964)
http://publications.europa.eu/resource/authority/country/BLR (0.9423658847808838)
http://publications.europa.eu/resource/authority/country/PNG (0.9423239231109619)
http://publications.europa.eu/resource/authority/country/BEL (0.9419782757759094)
http://publications.europa.eu/resource/authority/country/EST (0.9419098496437073)
http://publications.europa.eu/resource/authority/country/HRV (0.9417317509651184)
http://publications.europa.eu/resource/authority/country/MLT (0.9416428804397583)

Thanks for you support. Sevastian

janothan commented 2 years ago

Hi @ilseva,

thank you for reaching out. Currently, jRDF2vec works best if you use NT files. If you have multiple NT files, you can all place them in one directory and use the -graph <your directory> parameter. However, this only works if all files are in NT format. I know this is not optimal currently. I will try to support more formats for directory parsing, but I cannot commit to a specific timeline.

If you cannot transform your files into NT format, the process you described is correct.

I cannot comment in detail on your query issue. I recommend considering:

Do you have one embedding space (i.e., one kv file) which contains both vocabularies? -> Check the vocabulary.
If the results are non-intuitive to you, think how the input graph looks like. If there are no connections between public vocabularies and your dataset, there will be no similarities.
Given that separate vocabularies are quite interlinked among each other, you may want to refine the query such that you filter the result list for the most similar concepts of your dataset (throw out most similar concepts in other vocabularies) in case this is what you are truly interested in.

ilseva commented 2 years ago

Thanks @janothan for your considerations. We are able to convert all ttl and rdf files to nt format and the process is more streamlined.

We think that the query results are influenced by the number of our datasets which is lower than the entries of public vocabularies. We are still working on a PoC that aim to confirm if our approch for build semantic search engine in right.

Another question, if you can help us: do you think that is useful adding to the query concepts (aka filter) that belongs to the ontology? i.e. class names, object property names, ...

Thanks! Sevastian

WarisBunglawala commented 2 years ago

hello,

First of all thank you for such a nice solution. I have been working on the same kind of project as @ilseva but i have a few questions regarding it, hope you can help me solve the problem.

so first of all i have used one hudge dataset of size around 60gb and performed some cleaning and preprocessing.

after that i have created custom ontology for my knowledge graph, and created knowledge graph using python rdflib. I have access to super computer which have 96gb of ram so due to that i was able to create knowledge graph in 4 parts which means i have 4 ttl files based on the same custom ontology that i created. and all 4 ttl files can be said that combining all can generate one big KG.

now i wanted to created embeddings so i used your mode jrdf2vec, but as i mentioned file sizes are big even in ttl format they all combined to nearly 15 gb of size (1.4g +4.2g+ 4.2g +4.8g = 14.6 GB)

as mentioned by @janothan i was able to create nt files too and by default jrdf2vec also create nt files form ttl, but due to java heap memory exceptions i can not walk all of the files in one go as i can only provid so much of heap space which wasnt enough for all of the files combine

so i have generated walks for each of the 4 files seperate and move all gz to custom folders to save from overwritings. now i have 4 wlaks folder for each ttl file and then i used -mergeWalks to generate meregedWalks.txt for each 4 walks folder.

but the problem is that txt files are too big that one of the meregedWalks.txt reached 53.5 GB of size.

now my questions are: 1) Will i be able to do training for that much big files using 96gb of super computer ram ? 2) If i do the training for each txt file (4 files) seperatly, how can i merge all trained models as they are none but the parts from one big knowledge graph. 3) Basically the idea behind my project is I want to generate embeddings for semantic similarity and Recommendation Generation based on custom knowledge graph that is in 4 parts (4 ttl files) so if anyone can suggest better way to perform this task and help me out

Thank you for this wonderfull solution and detailed explanation.

janothan commented 2 years ago

Another question, if you can help us: do you think that is useful adding to the query concepts (aka filter) that belongs to the ontology? i.e. class names, object property names, ... ~ @ilseva

I am not sure whether I get the question. If you find class nodes not helpful, you can filter them. If you are thinking about datatype properties ("names" provided via rdfs:label etc.) - those are ignored by the approach and should not appear in the embedding space.

Will i be able to do training for that much big files using 96gb of super computer ram ? ~ @WarisBunglawala

Very likely this will work. The RAM requirements are significantly higher for the walk generation to speed-up the process. The actual training step is less memory consuming.

If i do the training for each txt file (4 files) seperatly, how can i merge all trained models as they are none but the parts from one big knowledge graph. ~ @WarisBunglawala

I do not advise doing this. These are separate embedding spaces. You could concatenate vectors etc. but I think this is quite a dirty approach.

One last remark:

so i have generated walks for each of the 4 files seperate ~ @WarisBunglawala

Please not that this leads to a different outcome than generating walks for the merged graph! I am not saying that it will not work but the walks will only be generated for each of the 4 files separately (rather than the complete graph). If you have an insufficient amount of memory, you could consider loading the 4 files in one HDT or one TDB store. jRDF2vec can also handle graphs stored on disk. The memory will for sure be sufficient but the walk generation will take significantly longer.

WarisBunglawala commented 2 years ago

@janothan Thanks for the guidence, as you mentioned this is not a good approach i will consider going with the TDB store. so here i am mentioning the details.

so unfortunatly the super computer that i have access to work via sftp protocols and i have limited acces to it.

on my personal laptop i have used jena fuseki but the thing is my laptop only have 8gb of ram so i cant imagine how much time it will take to generate walk for the given large size of data with 8gb of ram. And i can run fuseki server on super computer too but however i have no idea how can i upload the 4 ttl files to it as i can not use web interface from supercomputer. so if you know a easy way via any python program or cmd can you provide me the guidence for that too.

1) Will it be okay to upload ttl files to TDB or do i have to provide nt which one will be better? 2) Also can you provide some more details on how can i access the TDB stored knowledge graph using jrdf2vec model ? 3) While generating walks from TDB is there any chance that jrdf2vec encounter heap memory error ? 4) Generated walks from TDB store will most likly create the very large gz and mereged txt file, around 100GB so will i be able to train the model from it or will i encounter memory errors?

Sorry that i have asked so many questions and i undesrtand that some of them are not even directly related to your project. but i have to complete this project within a very short period of the time and generate recommendation using it and i am new to this kind of area, so i request you to consider all my question and guide me as per your experties and knowledge.

Thank you Again.

janothan commented 2 years ago

so unfortunatly the super computer that i have access to work via sftp protocols and i have limited acces to it. ~ @WarisBunglawala

You can easily build the TDB dataset via SSH and tdbloader on your server.
Alternatively, you can build the TDB dataset locally and just upload the TDB directory via SFTP.
There is no need for fuseki.

Will it be okay to upload ttl files to TDB or do i have to provide nt which one will be better? ~ @WarisBunglawala

There is no difference. TDB builds indices. Whatever input file you use, the result will be identical. (A relational database also does not care if you upload the data via CSV or TSV.)

Also can you provide some more details on how can i access the TDB stored knowledge graph using jrdf2vec model ? ~ @WarisBunglawala

The method call is identical. Just -graph <your TDB directory>

While generating walks from TDB is there any chance that jrdf2vec encounter heap memory error ? ~ @WarisBunglawala

Unlikely.

will i be able to train the model from it or will i encounter memory errors? ~ @WarisBunglawala

I really can't answer this question for you. This depends on various factors such as the number of unique nodes in the graph. Just try it.

On a general level: Do you use the java -Xmx95G ... command option? (Read more about it here.) If not, try this first because the JVM will not use all RAM otherwise.

so i request you to consider all my question and guide me as per your experties and knowledge. ~ @WarisBunglawala

I can try to help you, just be understanding that this is not my main job and that I may take some time to answer your questions.

WarisBunglawala commented 2 years ago

Thank you for your time @janothan it means a lot to me :)

That clears a lot and i will try all of the things you mentioned. And yes i do use -Xmx to provide more ram to jvm and i think it will be best if you mention this in your documentation too for the enhancment of your model work. as not many people will try this in first go but for bigger files it is worth mentioning.

ilseva commented 2 years ago

I am not sure whether I get the question. If you find class nodes not helpful, you can filter them. If you are thinking about datatype properties ("names" provided via rdfs:label etc.) - those are ignored by the approach and should not appear in the embedding space.

Thanks for your time and useful hints @janothan! We increase our datasets and then try to develop a strategy to define more appropriate filters. We plotted embeddings generated by jRDF2Vec and there are a lot of density near public vocabulary concepts. I think that could be the problem to investigate for our use case.

dwslab / jRDF2Vec

generate embeddings from multiple rdf/ttl #91