Closed alexanderpanchenko closed 8 years ago
Here you are the some details regarding the Extraction part.
The graphs are on the spartux: /home/panchenko/ner-graphs.tgz.
Copy them into the Graph folder of the English directory.
To extract Named Entities with the graph you need to:
I've added the uploaded the new verison of the grammars to spartux (the whole English directory with dictionaries etc):
/home/panchenko/English.tgz
Let me know if it works!
The graphs of Olga with sources are at
/home/panchenko/hypernymy.tgz
The corpus of wikipedia abstracts is available at:
http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/wacky-surface.csv
It is a single text file of 5Gb (~1 billion of tokens).
The corpus of web pages is available at:
http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/ukwac-surface.csv
It is a single text file of 12Gb (~2 billion of tokens).
You can obtain the full corpus with a concatenation:
cat wacky-surface.csv ukwac-surface.csv > surface.csv
I suggest to start the tests from the first 500 Mb of wikipedia corpus.
well done!
Motivation
Now patternsim deals only with single words or compound nouns in the dictionary. The goal is to improve extraction of relations between named entities, so "New York" is not stemmed to "york" and "San Francisco" is not stemmed to "francisco" as it happens now.
Implementation
Create a branch of the Pattensim which deal with NE-s.
Parsing
Extraction
The program should call Unitex graphs in a different way:
I propose to start with the Parsing features.