cental / PatternSim

A tool for calculation semantic similarity between words from a text corpus based on lexico-syntactic patterns.
28 stars 7 forks source link

NER support #4

Closed alexanderpanchenko closed 8 years ago

alexanderpanchenko commented 11 years ago

Motivation

Now patternsim deals only with single words or compound nouns in the dictionary. The goal is to improve extraction of relations between named entities, so "New York" is not stemmed to "york" and "San Francisco" is not stemmed to "francisco" as it happens now.

Implementation

Create a branch of the Pattensim which deal with NE-s.

Parsing

  1. The program should be able to parse the files annotated with the NE recognized and relation extractor:
  2. The program outputs are exactly the same as the main branch of the patternsim.
  3. However, now the program preserves case of the named entities (but not the common words!). So, entities such as "San Francisco" or "New York" are not transformed into "new york" and "san francisco".

Extraction

The program should call Unitex graphs in a different way:

  1. The program apply _namedentities.fst2 graph in the merge mode.
  2. The annotated text is saved.
  3. The relation extraction graph _hyperonymymain.fst2 is applied in the merge mode.
  4. The program performs parsing of the concord.ind produced after the relation extraction (see above).

I propose to start with the Parsing features.

alexanderpanchenko commented 11 years ago

Here you are the some details regarding the Extraction part.

The graphs are on the spartux: /home/panchenko/ner-graphs.tgz.

Copy them into the Graph folder of the English directory.

To extract Named Entities with the graph you need to:

  1. Open the input text.
  2. Preprocess it in a standard way.
  3. Apply Named Entity graphs (locate pattern, megre mode, longest match, Graphs/named_entities/named_entities.fst2).
  4. Save the text in merge mode with the NE annotations.
  5. Close the current text.
  6. Open the text with NE annotations.
  7. Apply relation extraction graphs (locate pattern, megre mode, longest match, Graphs/hypernymy/hypernym_main.fst2).
  8. Save the concordance of the second extraction.
  9. Delete temporary files related to the first (3) and the second (7) extractions.
alexanderpanchenko commented 11 years ago

I've added the uploaded the new verison of the grammars to spartux (the whole English directory with dictionaries etc):

/home/panchenko/English.tgz

Let me know if it works!

alexanderpanchenko commented 11 years ago

The graphs of Olga with sources are at

/home/panchenko/hypernymy.tgz
alexanderpanchenko commented 11 years ago

The corpus of wikipedia abstracts is available at:

http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/wacky-surface.csv

It is a single text file of 5Gb (~1 billion of tokens).

The corpus of web pages is available at:

http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/ukwac-surface.csv

It is a single text file of 12Gb (~2 billion of tokens).

You can obtain the full corpus with a concatenation:

cat wacky-surface.csv ukwac-surface.csv > surface.csv 

I suggest to start the tests from the first 500 Mb of wikipedia corpus.

alexanderpanchenko commented 8 years ago

well done!