A tool for calculation semantic similarity between words from a text corpus based on lexico-syntactic patterns.
LGPLv3: http://www.gnu.de/documents/lgpl-3.0.en.html
A tool for extraction of raw extraction counts with lexico-syntactic patterns.
Requirements
Installation on Ubuntu 12.04
Quick Start
Use ./rerank.sh to rerank relations with the default formula, and as an example of usage of patternsim-rank.
Synopsis
patternsim [options] [corpus_file(s) ...]
Options
Usage:
patternsim [options] [corpus_file(s) ...]
Mandatory options:
--unitex Unitex main directory
--output (-o) output directory
Options:
--vocabulary (-v) input vocabulary file
--workers (-w) number of workers
--language (-l) language
--list-languages list all available languages
--verbose verbose mode
--help brief help message
--man full documentation
Options:
--unitex *unitex_main_directory*
Specify the Unitex main directory if you want to use your own
Unitex installation (overwite the patternsim configuration file)
--output -o *output_directory*
Specify the output directory.
--vocabulary --vocab -v *vocabulary_file*
Specify the UTF-8 input vocabulary file (one word per line)
--workers -w *number_of_workers*
Specify the number of parallel workers Workers will extract in
parallel semantic relations. A good number of workers will be
the number of CPU cores minus 1.
--language -w *language_id*
Specify the current language
--list-languages
Show all available languages (language_id and full name)
--verbose
Explains what is being done
--help -h
Prints a brief help message and exits.
--man Prints the manual page and exits.
--verbose
Activates the verbose mode. Explains all the processes. Outputs
will be shown on stderr.
Example
./patternsim --unitex /home/user/Unitex3.0beta -v vocabulary.txt -o output corpus.txt
The output of this command -- a set of files in the directory "./output":
The files conc-freq.csv and corpus-freq.csv are CSV files in the following format:
word;frequency\n
The files pairs.csv, pairs-np.csv and pairs-voc.csv are CSV files in the following format:
target-word;relatum-word;e-syno;e-cohypo;e-hyper-hypo;e-hyper;e-hypo;e-all;e1;e2;e3;е4;е5;е6;е7;е8;е9;е10;е11;е12;е13;е14;е15;е16;е17\n
Here target-word and related-word are words, ' e-all is the number of extractions between target-word and relatum-word with all the 17 patterns, ei is number of extractions between target-word and relatum-word with the i-th pattern (see the referenced above paper for details). Thus e-all = sum_i (ei).
e-syno, e-cohypo, e-hyper, e-hyper-hypo, e-hypo is the number of specific relations extracted between terms (synonyms, co-hyponyms, hypernyms, hyponym, hypernyms+hyponyms).
Corpus
Here are some corpora which you may use with this tool:
Russian morphological dictionary
The Russian dictionary in this repository is an extract of the Russian computational morphological dictionary developed at CIS, Munich. This extract contains about 15% of the original dictionary (the most frequent lemmata). The whole dictionary actually contains 140,000 simple entries (= 2.7 million distinct forms), 166,000 simple proper nouns (= 900,000 distinct forms) and 1800 compound words.
If you want to use the full version of the lexicon, please contact:
Sebastian Nagel
CIS
Oettingenstr. 67
80538 München
Germany
wastl@cis.uni-muenchen.de
http://www.cis.uni-muenchen.de
For additional information see:
Nagel, Sebastian 2002: Formenbildung im Russischen. Formale Beschreibung und Automatisierung für das CISLEX-Wörterbuchsystem (http://www.cis.uni-muenchen.de/~wastl/pub/ruslex.pdf)
For a short description (in German), see http://www.cis.uni-muenchen.de/~wastl/pub/ruslexUnitex.pdf
Reranking semantic similarity scores between words extracted with the patternsim. Directory -- "rank".
Synopsis
patternsim-rank [options]
System Requirements
Binaries
Binaries are readily available the bin folder. On Unix based systems you may use "./patternsim-rank" or "./patternsim-rank.exe". On Windows, use "patternsim-rank.exe".
Testing
Recompilation
Options
p, pairs
Required. An UTF-8 encoded CSV file in provided by the PattenSim program. In the format:
target;relatum;syno;cohypo;hyper_hypo;hyper;hypo;sum;pattern;pattern2;pattern3;pattern4;pattern5;pattern6;pattern7;pattern8;pattern9;pattern10;pattern11;pattern12;pattern13;pattern14;pattern15;pattern16;pattern17
This file must contain symmetric relations between words (generated by the PatternSim by default). If there exist a relation 'target;relatum;type;sim' then there should exist one and only one relation 'relatum;target;type;sim' in the same file.
o, output
Required. An UTF-8 encoded CSV file 'target;relatum;sim', where 'sim' is similarity score between 'target' and 'relatum'. This file is sorted by 'target' and then 'sim'.
c, corpusfreq
Required. An UTF-8 encoded CSV file 'word;freq' with frequencies of words.
t, type
Required. Type of reranking:
a, alpha
Expected number of relations per word, default -- 15.
b, beta
Minimum number of extractions which establish a relation between words, default -- 2.
s, sqrt
Sqrt of the number of different patterns, default -- true.