HamedBabaei / LLMs4OL

LLMs4OL:‌ Large Language Models for Ontology Learning
MIT License
87 stars 8 forks source link

Language Models as a Knowledge base - experimentation for our datasets #2

Closed HamedBabaei closed 1 year ago

HamedBabaei commented 1 year ago

It seems that they have made an alignment with Wikipedia texts to obtain a sentence with that specific subject or object entity. Then to make a prediction over an object entity they used MASKs. For concept net, they have considered their own base dataset sentences. Then they created these query templates for relations (in our task I should create it for entity types as well) to query LMs.

According to their work, for us, the input should consist of the alignment text and the query template. For example for word net, we can obtain examples for any synset, for example for an entity cover we can get this sentence: "cover the child with a blanket" and adding the template at the end it would be: "cover the child with a blanket. cover word type is a [MASK]" (or any other template look like this -- this is just example) where the MASK is 'verb' (this is only an idea, but First step is to test this paper idea)

Most of the datasets which didn't have sentences from their own sources relied on Wikipedia! And I had a little look at their code and I understood that they only used embedding and vocabulary which they obtained from each LMs, to calculate a probability for tokens, and then they picked up the top ones and used search engine metrics to evaluate the results.

Now the task is for entity type detection lets:

HamedBabaei commented 1 year ago

Short summary of the paper:

1. Chosen models:


2. Knowledge Sources: They transform the origin of the fact triples into cloze templates, and to what extent aligned texts exist in Wikipedia that is known to express a particular fact.

According to the codes, they ONLY considered these cloze templates for Google-RE dataset

  • Google-RE corpus: 3 fact has been considered with the following templates:

    "relation": "place_of_birth", "template": "[X] was born in [Y] .", "template_negated": "[X] was not born in [Y] .",

    "relation": "date_of_birth", "template": "[X] (born [Y]).", "template_negated": "[X] (not born [Y]).",

    "relation": "place_of_death", "template": "[X] died in [Y] .", "template_negated": "[X] did not die in [Y] .", and samples from this dataset that are aligned with Wikipedia text supporting it.

{"pred": "/people/person/date_of_birth", "sub": "/m/09gb0bw", "obj": "1941", "evidences": [{"url": "http://en.wikipedia.org/wiki/Peter_F._Martin", "snippet": "Peter F. Martin (born 1941) is an American politician who is a Democratic member of the Rhode Island House of Representatives. He has represented the 75th District Newport since 6 January 2009. He is currently serves on the House Committees on Judiciary, Municipal Government, and Veteran's Affairs. During his first term of office he served on the House Committees on Small Business and Separation of Powers & Government Oversight. In August 2010, Representative Martin was appointed as a Commissioner on the Atlantic States Marine Fisheries Commission", "considered_sentences": ["Peter F Martin (born 1941) is an American politician who is a Democratic member of the Rhode Island House of Representatives ."]}], "judgments": [{"rater": "18349444711114572460", "judgment": "yes"}, {"rater": "17595829233063766365", "judgment": "yes"}, {"rater": "4593294093459651288", "judgment": "yes"}, {"rater": "7387074196865291426", "judgment": "yes"}, {"rater": "17154471385681223613", "judgment": "yes"}], "sub_w": null, "sub_label": "Peter F. Martin", "sub_aliases": [], "obj_w": null, "obj_label": "1941", "obj_aliases": [], "uuid": "18af2dac-21d3-4c42-aff5-c247f245e203", "masked_sentences": ["Peter F Martin (born [MASK]) is an American politician who is a Democratic member of the Rhode Island House of Representatives ."]}


3. Evaluation Metric: Mean Precision at K (P@K). Here K=1.


4. Considerations:

  1. Manually defined templates: This means that they are measuring lower bound for what LMs know.
  2. Single Token: Because multi-token decoding adds a number of additional tuneable parameters.
  3. Object Slots: Using reverse relation they are able to query subjects as well. No relation slot query because: (1) Multi token phrases issue, (2) unclear what will be the gold standard pattern for relations
  4. Intersection of Vocabularies: ELMo vocab size: 800k, BERT vocab size: 30k. Larger vocab harder it would be to rank the gold token at the top. So they limit vocab size to 21k common vocabs.

5. Results:

HamedBabaei commented 1 year ago

The weakness of this paper: for some of our entities we have to mask more than one word, for this reason in order to test this paper we move forward with level 1 in Geonames and UMLS datasets.

To solve this issue we will move forward with BART!