dbpedia / GSoC

Google Summer of Code organization
37 stars 27 forks source link

DBpedia Embeddings for Out-Of-Vocabulary Resources #12

Closed mommi84 closed 5 years ago

mommi84 commented 6 years ago

Description

A DBpedia Knowledge Base embedding (KB embedding) is a learned high-dimensional representation of a KB symbol. In less cryptic words, we want to build a vector of D dimensions (e.g. D=300) to represent each URI. In this project, we aim at adding code such that the DBpedia extraction framework outputs one real-valued vector of D dimensions for each DBpedia Instance, each Class and each Property such that if two resources (I, C or R) are similar in meaning, then those vectors are also close to each other in vector space. These vectors are learned from DBpedia’s graph itself (plus Wikipedia text), and evaluations can be done with heldout data also from DBpedia or Wikipedia, or from existing eval datasets. During last year’s GSoC, students implemented a novel algorithm for scalable KB embedding and found that one existing algorithm can scale to the size of DBpedia. However, these approaches still have not been tested on out-of-vocabulary (OOV) resources. That is, the ability of returning a vector for those resources that did not belong to the training set. 500-dimensional RVA embeddings

Goals

Devise or adapt a model for KB embedding which can deal with out-of-vocabulary resources.

Impact

Embeddings are widely used in NLP as a way to encode distributional semantics. Distributed representations of DBpedia resources will allow people to use semantic similarity to help with entity linking, relationship extraction, etc. They may be used to extend type coverage, add missing links, etc making DBpedia more complete as a KB.

Warm up tasks

bharat-suri commented 6 years ago

Can we try to implement Recurrent Neural Network Embedding for KB completion?

mommi84 commented 6 years ago

Hi @tramplingWillow, sure. Please complete the warm-up tasks if you haven't yet and elaborate your proposal in a Google doc. When you're done, add my username at gmail.com to the doc.

mgns commented 6 years ago

Student proposal for this task: https://github.com/dbpedia/GSoC/issues/14

amanmehta-maniac commented 6 years ago

We can take first step for the OOV Word Embedding challenge by learning vectors for character n-grams within the word and later summing them to produce final word embedding. I have been working on similar ideas on my research project at my university as well. What do you think?

seqryan commented 6 years ago

Error on running RVA-based embedding algorithm

File "WikiDetector.py", line 132, in surfForms = loadSurfaceForms("AnchorDictionary.csv", 5)#selecting the top 5 most common anchor text File "WikiDetector.py", line 90, in loadSurfaceForms with open(filename, 'r') as output: FileNotFoundError: [Errno 2] No such file or directory: 'AnchorDictionary.csv'

How do I generate this file?

bharat-suri commented 6 years ago

Hi @seqryan, you need to run the script MakeDictionary.py with all the text files from the wiki dump as input. That will generate the file, `'AnchorDictionary.csv'. Use the output directory of WikiExtractor.py as input as it will contain all the plain text files extracted from the xml dump.

mommi84 commented 6 years ago

Hi @amanmehta-maniac, that sounds good. Please prepare a proposal draft in a Google doc, where you explain how you intend to handle the problem. When you're done, invite my username at gmail.com to the doc.

Thank you @seqryan and @tramplingWillow for your debugging efforts!

I also added an example of successful project proposal.

seqryan commented 6 years ago

Considering our goal is to predict information about Out-Of-Vocabulary Resources, I have 2 queries

1) what information is available about Out-Of-Vocabulary Resources (since the embeddings will use this information to infer other properties) 2) what information do we need to predict about Out-Of-Vocabulary Resources.

amanmehta-maniac commented 6 years ago

Is there a need to download the entire dump in the warm-up tasks? Or can I get a gist using some other substitute dump(of small size) considering the original dump zip size itself is ~14GB, if I am not wrong. @tramplingWillow any leads?

bharat-suri commented 6 years ago

@amanmehta-maniac , No, you need not. As @mommi84 has mentioned, just a subset is needed to perform the analysis. I figure it's important that you simply get yourself familiarised with the workflow of the RVA Algorithm based embedding system.

amanmehta-maniac commented 6 years ago

Cool. Which substitute dump would be good? Can you link me up with the dump you used? :)

bharat-suri commented 6 years ago

Here, https://dumps.wikimedia.org/enwiki/latest/ This is the index for the latest wiki dumps. I used the dump with the article names and abstracts. https://dumps.wikimedia.org/enwiki/20180220/enwiki-20180220-pages-articles1.xml-p10p30302.bz2

amanmehta-maniac commented 6 years ago

My 'AnchorDictionary.csv' comes out to be blank. Should it be blank? I think it should be, but just confirming. Also, while running `python WikiDetector.py enwiki-20170520-pages-articles.xml', I get a FileNotFound for file: "EntityGender.csv". How do I solve this?

bharat-suri commented 6 years ago

Running the RVA-based embedding Algorithm.

  1. I used the wikipedia dump as input to WikiExtractor.py
  2. Then we clean the xml dump and extract plain text using the script WikiExtractor.py with the following :
    python WikiExtractor.py ../enwiki-20180220-pages-articles1.xml -o text
    python WikiExtractor.py ../enwiki-20180220-pages-articles1.xml --links -o output
  3. We make the global dictionary, AnchorDictionary.csv using the script MakeDictionary.py.
    python MakeDictionary.py output/
  4. Next up, we need the dictionary that maps the gender based pronouns with the help of dbo:Person, in EntityGender.csv .
    python CheckPerson.py text

    Then run the script, WikiDetector.py

vindex10 commented 6 years ago

to support a trend of posting HowTo's

Running RVA

In the process of running RVA provided here I encountered several issues. This post is to highlight them.

It is convenient for testing purposes to try embedding algorithms on data you are expert in. Special:Export from Wiki provides a chance to export specific categories of knowledge. For instance ML or Physics. These are categories I used to play around with code.

  1. Download the latest version of WikiExtractor (for me the one present in github repo didn't create tags). Make git-clone initialize git-submodule for this repo to appear at "gsoc2017-akshay/wikiextractor".
  2. To be able to use Makefile provided below, put your dump into "gsoc2017-akshay/data/src/wiki.xml".
  3. Put the following Makefile into "gsoc2017-akshay/Makefile"
    
    FNAME?=wiki.xml
    DATA_DIR?=./data

.PHONY: all extract dict chkperson detect train rva plots clean

all: extract dict chkperson detect rva plots

extract: python ./wikiextractor/WikiExtractor.py -o $(DATA_DIR)/output $(DATA_DIR)/src/$(FNAME) python ./wikiextractor/WikiExtractor.py --links -o $(DATA_DIR)/text $(DATA_DIR)/src/$(FNAME)

dict: python ./MakeDictionary.py $(DATA_DIR)/text

chkperson: python ./CheckPerson.py $(DATA_DIR)/text

this is to fix FileNotFound error

touch gender.csv
mv gender.csv EntityGender.csv

detect: cp -r $(DATA_DIR)/text $(DATA_DIR)/adapted python ./WikiDetector.py $(DATA_DIR)/adapted/

train: python ./WikiTrainer.py $(DATA_DIR)/adapted/

rva: python ./RVA.py $(DATA_DIR)/adapted

plots: python tsne.py

clean: rm -rf $(DATA_DIR)/{output,text,adapted} rm -rf ./AnchorDictionary.csv rm -rf ./EntityGender.csv rm -rf vocab word2vec_gensim word2vec_org rm -rf index labels embeddings

4. Run `make` to: extract data, build dictionary, analyse genders, compute embeddings with RVA and produce a plot.

#### \* remark:
It was an issue with plotting in Python 3.5. A file **tsne.py** should be modified:
mommi84 commented 6 years ago

We really appreciate your exchange of how-to's! Looking forward to reading your project proposals.

vindex10 commented 6 years ago

I made some research in OOV embeddings. I found out that there are two kinds of "word closeness": semantic and morphological. First one is good for in-vocabulary-embeddings and is well implemented with word2vec. But, when one sees unfamiliar word, he just trying to decompose it into parts, that is how morphological metrics work. Facebook implemented this approach in their FastText library. They also have cbow (bag of words) model, which is somehow related to RVA, proposed above. There is an idea, to apply n-gram metric to dbpedia porposes. I can see two ways of doing that:

I would like to discuss, and hear some critics on these approaches to generate more specific and technically strong proposal for GSOC. I decided to write here, because, as far as I understood from description on DBpedia GSOC guide, one should discuss project specific issues here, until becoming an official participant. In case it is wrong place, please correct me :)

Thank you!

Resources: fasttext fasttext review stackoverflow

bharat-suri commented 6 years ago

Recent papers have shown that investigating spelling of unseen words is a source of auxiliary data. Apart from that, there is the possibility of using the definition of those rare words as some external data. Since these words are going to be rare in occurrence, we can embed these definitions with the help of a different LSTM-RNN. Thus, the network can be trained to produce embedding of a dictionary definition in an end-to-end fashion. With this, we can add the ability to deal with OOV text to the existing model. This approach seems to incorporate the semantics of an unseen resource, as opposed to using random vectors, however, we might need to go back to using it when dealing with Proper Nouns during test time.

Feedback from the community is highly valuable, so I would like to discuss this with the mentors as well as other community members. Thank You

mommi84 commented 6 years ago

The GSoC 2018 student applications are officially open! Please elaborate your proposal in a Google doc. When you're done, share it with my username at gmail.com, so I can invite also the other mentors. Deadline: March 27.

@vindex10 and @tramplingWillow: I think the use of n-grams could help, but what would happen if an n-gram does not belong to the training set? What are the chances that all possible n-grams will figure there?

Anyhow, while I also find the discussion with the community valuable, I would rather suggest you to concentrate on your own proposals in these last 11 days. 🙂

mommi84 commented 6 years ago

Only 6 days to go!

Please share your document with us now, if you would like to have some feedback from the mentors before the final submission to the GSoC console.