Closed mommi84 closed 5 years ago
Can we try to implement Recurrent Neural Network Embedding for KB completion?
Hi @tramplingWillow, sure. Please complete the warm-up tasks if you haven't yet and elaborate your proposal in a Google doc. When you're done, add my username at gmail.com to the doc.
Student proposal for this task: https://github.com/dbpedia/GSoC/issues/14
We can take first step for the OOV Word Embedding challenge by learning vectors for character n-grams within the word and later summing them to produce final word embedding. I have been working on similar ideas on my research project at my university as well. What do you think?
Error on running RVA-based embedding algorithm
File "WikiDetector.py", line 132, in surfForms = loadSurfaceForms("AnchorDictionary.csv", 5)#selecting the top 5 most common anchor text File "WikiDetector.py", line 90, in loadSurfaceForms with open(filename, 'r') as output: FileNotFoundError: [Errno 2] No such file or directory: 'AnchorDictionary.csv'
How do I generate this file?
Hi @seqryan, you need to run the script MakeDictionary.py with all the text files from the wiki dump as input. That will generate the file, `'AnchorDictionary.csv'. Use the output directory of WikiExtractor.py as input as it will contain all the plain text files extracted from the xml dump.
Hi @amanmehta-maniac, that sounds good. Please prepare a proposal draft in a Google doc, where you explain how you intend to handle the problem. When you're done, invite my username at gmail.com to the doc.
Thank you @seqryan and @tramplingWillow for your debugging efforts!
I also added an example of successful project proposal.
Considering our goal is to predict information about Out-Of-Vocabulary Resources, I have 2 queries
1) what information is available about Out-Of-Vocabulary Resources (since the embeddings will use this information to infer other properties) 2) what information do we need to predict about Out-Of-Vocabulary Resources.
Is there a need to download the entire dump in the warm-up tasks? Or can I get a gist using some other substitute dump(of small size) considering the original dump zip size itself is ~14GB, if I am not wrong. @tramplingWillow any leads?
@amanmehta-maniac , No, you need not. As @mommi84 has mentioned, just a subset is needed to perform the analysis. I figure it's important that you simply get yourself familiarised with the workflow of the RVA Algorithm based embedding system.
Cool. Which substitute dump would be good? Can you link me up with the dump you used? :)
Here, https://dumps.wikimedia.org/enwiki/latest/ This is the index for the latest wiki dumps. I used the dump with the article names and abstracts. https://dumps.wikimedia.org/enwiki/20180220/enwiki-20180220-pages-articles1.xml-p10p30302.bz2
My 'AnchorDictionary.csv' comes out to be blank. Should it be blank? I think it should be, but just confirming. Also, while running `python WikiDetector.py enwiki-20170520-pages-articles.xml', I get a FileNotFound for file: "EntityGender.csv". How do I solve this?
python WikiExtractor.py ../enwiki-20180220-pages-articles1.xml -o text
python WikiExtractor.py ../enwiki-20180220-pages-articles1.xml --links -o output
python MakeDictionary.py output/
python CheckPerson.py text
Then run the script, WikiDetector.py
to support a trend of posting HowTo's
In the process of running RVA provided here I encountered several issues. This post is to highlight them.
It is convenient for testing purposes to try embedding algorithms on data you are expert in. Special:Export from Wiki provides a chance to export specific categories of knowledge. For instance ML or Physics. These are categories I used to play around with code.
FNAME?=wiki.xml
DATA_DIR?=./data
.PHONY: all extract dict chkperson detect train rva plots clean
all: extract dict chkperson detect rva plots
extract: python ./wikiextractor/WikiExtractor.py -o $(DATA_DIR)/output $(DATA_DIR)/src/$(FNAME) python ./wikiextractor/WikiExtractor.py --links -o $(DATA_DIR)/text $(DATA_DIR)/src/$(FNAME)
dict: python ./MakeDictionary.py $(DATA_DIR)/text
chkperson: python ./CheckPerson.py $(DATA_DIR)/text
touch gender.csv
mv gender.csv EntityGender.csv
detect: cp -r $(DATA_DIR)/text $(DATA_DIR)/adapted python ./WikiDetector.py $(DATA_DIR)/adapted/
train: python ./WikiTrainer.py $(DATA_DIR)/adapted/
rva: python ./RVA.py $(DATA_DIR)/adapted
plots: python tsne.py
clean: rm -rf $(DATA_DIR)/{output,text,adapted} rm -rf ./AnchorDictionary.csv rm -rf ./EntityGender.csv rm -rf vocab word2vec_gensim word2vec_org rm -rf index labels embeddings
4. Run `make` to: extract data, build dictionary, analyse genders, compute embeddings with RVA and produce a plot.
#### \* remark:
It was an issue with plotting in Python 3.5. A file **tsne.py** should be modified:
Finally, an example of the output I provided an image (some sensible part of entire plot) in attachment.
![dbped-embed](https://user-images.githubusercontent.com/1962652/36972656-35de25b0-2070-11e8-98e3-f2a73ead28ea.png)
We really appreciate your exchange of how-to's! Looking forward to reading your project proposals.
I made some research in OOV embeddings. I found out that there are two kinds of "word closeness": semantic and morphological. First one is good for in-vocabulary-embeddings and is well implemented with word2vec. But, when one sees unfamiliar word, he just trying to decompose it into parts, that is how morphological metrics work. Facebook implemented this approach in their FastText library. They also have cbow (bag of words) model, which is somehow related to RVA, proposed above. There is an idea, to apply n-gram metric to dbpedia porposes. I can see two ways of doing that:
I would like to discuss, and hear some critics on these approaches to generate more specific and technically strong proposal for GSOC. I decided to write here, because, as far as I understood from description on DBpedia GSOC guide, one should discuss project specific issues here, until becoming an official participant. In case it is wrong place, please correct me :)
Thank you!
Resources: fasttext fasttext review stackoverflow
Recent papers have shown that investigating spelling of unseen words is a source of auxiliary data. Apart from that, there is the possibility of using the definition of those rare words as some external data. Since these words are going to be rare in occurrence, we can embed these definitions with the help of a different LSTM-RNN. Thus, the network can be trained to produce embedding of a dictionary definition in an end-to-end fashion. With this, we can add the ability to deal with OOV text to the existing model. This approach seems to incorporate the semantics of an unseen resource, as opposed to using random vectors, however, we might need to go back to using it when dealing with Proper Nouns during test time.
Feedback from the community is highly valuable, so I would like to discuss this with the mentors as well as other community members. Thank You
The GSoC 2018 student applications are officially open! Please elaborate your proposal in a Google doc. When you're done, share it with my username at gmail.com, so I can invite also the other mentors. Deadline: March 27.
@vindex10 and @tramplingWillow: I think the use of n-grams could help, but what would happen if an n-gram does not belong to the training set? What are the chances that all possible n-grams will figure there?
Anyhow, while I also find the discussion with the community valuable, I would rather suggest you to concentrate on your own proposals in these last 11 days. 🙂
Only 6 days to go!
Please share your document with us now, if you would like to have some feedback from the mentors before the final submission to the GSoC console.
Description
A DBpedia Knowledge Base embedding (KB embedding) is a learned high-dimensional representation of a KB symbol. In less cryptic words, we want to build a vector of D dimensions (e.g. D=300) to represent each URI. In this project, we aim at adding code such that the DBpedia extraction framework outputs one real-valued vector of D dimensions for each DBpedia Instance, each Class and each Property such that if two resources (I, C or R) are similar in meaning, then those vectors are also close to each other in vector space. These vectors are learned from DBpedia’s graph itself (plus Wikipedia text), and evaluations can be done with heldout data also from DBpedia or Wikipedia, or from existing eval datasets. During last year’s GSoC, students implemented a novel algorithm for scalable KB embedding and found that one existing algorithm can scale to the size of DBpedia. However, these approaches still have not been tested on out-of-vocabulary (OOV) resources. That is, the ability of returning a vector for those resources that did not belong to the training set.![500-dimensional RVA embeddings](https://akshayjagatap.files.wordpress.com/2017/09/500.png?w=800)
Goals
Devise or adapt a model for KB embedding which can deal with out-of-vocabulary resources.
Impact
Embeddings are widely used in NLP as a way to encode distributional semantics. Distributed representations of DBpedia resources will allow people to use semantic similarity to help with entity linking, relationship extraction, etc. They may be used to extend type coverage, add missing links, etc making DBpedia more complete as a KB.
Warm up tasks
Mentors
Peng Xu, Thiago Galery, Tommaso Soru
Keywords
knowledge graph embedding, vector space model, distributional semantics