cental / PatternSim

A tool for calculation semantic similarity between words from a text corpus based on lexico-syntactic patterns.
28 stars 7 forks source link

PatternSim

A tool for calculation semantic similarity between words from a text corpus based on lexico-syntactic patterns.

License

LGPLv3: http://www.gnu.de/documents/lgpl-3.0.en.html

patternsim

A tool for extraction of raw extraction counts with lexico-syntactic patterns.

Requirements

Installation on Ubuntu 12.04

  1. Install Unitex 3.0beta (http://www-igm.univ-mlv.fr/~unitex/zips/Unitex3.0beta.zip)
  2. Install cpanm: "sudo cpan App::cpanminus"
  3. Install all dependencies: "sudo cpanm --installdeps ."

Quick Start

Use ./rerank.sh to rerank relations with the default formula, and as an example of usage of patternsim-rank.

Synopsis

patternsim [options] [corpus_file(s) ...]

Options

Usage:
patternsim [options] [corpus_file(s) ...]

  Mandatory options:
    --unitex                 Unitex main directory
    --output (-o)            output directory

  Options:
    --vocabulary (-v)        input vocabulary file
    --workers (-w)           number of workers
    --language (-l)          language

    --list-languages         list all available languages

    --verbose                verbose mode
    --help                   brief help message
    --man                    full documentation

Options:
--unitex *unitex_main_directory*
        Specify the Unitex main directory if you want to use your own
        Unitex installation (overwite the patternsim configuration file)

--output -o *output_directory*
        Specify the output directory.

--vocabulary --vocab -v *vocabulary_file*
        Specify the UTF-8 input vocabulary file (one word per line)

--workers -w *number_of_workers*
        Specify the number of parallel workers Workers will extract in
        parallel semantic relations. A good number of workers will be
        the number of CPU cores minus 1.

--language -w *language_id*
        Specify the current language

--list-languages
        Show all available languages (language_id and full name)

--verbose
        Explains what is being done

--help -h
        Prints a brief help message and exits.

--man   Prints the manual page and exits.

--verbose
        Activates the verbose mode. Explains all the processes. Outputs
        will be shown on stderr.

Example

./patternsim --unitex /home/user/Unitex3.0beta -v vocabulary.txt -o output corpus.txt

The output of this command -- a set of files in the directory "./output":

The files conc-freq.csv and corpus-freq.csv are CSV files in the following format:

word;frequency\n

The files pairs.csv, pairs-np.csv and pairs-voc.csv are CSV files in the following format:

target-word;relatum-word;e-syno;e-cohypo;e-hyper-hypo;e-hyper;e-hypo;e-all;e1;e2;e3;е4;е5;е6;е7;е8;е9;е10;е11;е12;е13;е14;е15;е16;е17\n

Here target-word and related-word are words, ' e-all is the number of extractions between target-word and relatum-word with all the 17 patterns, ei is number of extractions between target-word and relatum-word with the i-th pattern (see the referenced above paper for details). Thus e-all = sum_i (ei).

e-syno, e-cohypo, e-hyper, e-hyper-hypo, e-hypo is the number of specific relations extracted between terms (synonyms, co-hyponyms, hypernyms, hyponym, hypernyms+hyponyms).

Corpus

Here are some corpora which you may use with this tool:

Russian morphological dictionary

The Russian dictionary in this repository is an extract of the Russian computational morphological dictionary developed at CIS, Munich. This extract contains about 15% of the original dictionary (the most frequent lemmata). The whole dictionary actually contains 140,000 simple entries (= 2.7 million distinct forms), 166,000 simple proper nouns (= 900,000 distinct forms) and 1800 compound words.

If you want to use the full version of the lexicon, please contact:

Sebastian Nagel
CIS
Oettingenstr. 67
80538 München
Germany
wastl@cis.uni-muenchen.de
http://www.cis.uni-muenchen.de

For additional information see:

Nagel, Sebastian 2002: Formenbildung im Russischen. Formale Beschreibung und Automatisierung für das CISLEX-Wörterbuchsystem (http://www.cis.uni-muenchen.de/~wastl/pub/ruslex.pdf)

For a short description (in German), see http://www.cis.uni-muenchen.de/~wastl/pub/ruslexUnitex.pdf

rank

Reranking semantic similarity scores between words extracted with the patternsim. Directory -- "rank".

Synopsis

patternsim-rank [options]

System Requirements

Binaries

Binaries are readily available the bin folder. On Unix based systems you may use "./patternsim-rank" or "./patternsim-rank.exe". On Windows, use "patternsim-rank.exe".

Testing

  1. Download test data http://cental.fltr.ucl.ac.be/team/~panchenko/sim-eval/patternsim-rank-data.tgz.
  2. Save the archive to the "rank" directory.
  3. Extract the data (tar xzf patternsim-rank-data.tgz). The directory "data" should appear.
  4. Run tests.sh script. It will produce the output in the data/output folder.

Recompilation

  1. Open patternsim-rank.sln with MonoDevelop or Visual Studio.
  2. Build the solution.

Options

p, pairs

Required. An UTF-8 encoded CSV file in provided by the PattenSim program. In the format:

target;relatum;syno;cohypo;hyper_hypo;hyper;hypo;sum;pattern;pattern2;pattern3;pattern4;pattern5;pattern6;pattern7;pattern8;pattern9;pattern10;pattern11;pattern12;pattern13;pattern14;pattern15;pattern16;pattern17

This file must contain symmetric relations between words (generated by the PatternSim by default). If there exist a relation 'target;relatum;type;sim' then there should exist one and only one relation 'relatum;target;type;sim' in the same file.

o, output

Required. An UTF-8 encoded CSV file 'target;relatum;sim', where 'sim' is similarity score between 'target' and 'relatum'. This file is sorted by 'target' and then 'sim'.

c, corpusfreq

Required. An UTF-8 encoded CSV file 'word;freq' with frequencies of words.

t, type

Required. Type of reranking:

  1. Efreq, no reranking, transform scores to the interval [0;1].
  2. Efreq-Rfreq, reranking by frequency of relations to other words. Uses option 'alpha'.
  3. Efreq-Rnum, reranking by number of relations to other words. Uses option 'beta'.
  4. Efreq-Cfreq, reranking by word frequency. Uses option 'corpusfreq'.
  5. Efreq-Rnum-Cfreq, reranking by number of relations to other words and by word frequency. Uses options 'beta' and 'corpusfreq'.
  6. Efreq-Rnum-Cfreq-Pnum, reranking by number of relations to other words, by word frequency and by number of different patterns extracted the relations. Uses options 'corpusfreq', 'patterns', 'beta' and 'sqrt'.

a, alpha

Expected number of relations per word, default -- 15.

b, beta

Minimum number of extractions which establish a relation between words, default -- 2.

s, sqrt

Sqrt of the number of different patterns, default -- true.