HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
343 stars 67 forks source link

HeidelTime (Chinese) throws Exception / Error #31

Closed bisoldi closed 9 years ago

bisoldi commented 9 years ago

When attempting to run HeidelTime with Chinese, we get a FileNotFoundException. We are testing on the following Chinese text:

“西门商场(2013年9月31日武装分子在肯尼亚首都内罗毕袭击的购物中心)的时候我身在艺术咖啡厅,这次我又住在半岛酒店……凌晨4点前一切被炸成地狱的时候,我正准备离开我的房间。”

I'm running the following command:

java -jar de.unihd.dbs.heideltime.standalone.jar ~/chinese.txt -l chinese -t narrative -vv -pos treetagger

The config file points to the proper directory, however we get the below exception:

java.io.FileNotFoundException: /home/bisoldi/bin/heideltime/treetagger/chinese-tokenizer/zh-tokenise/segment-zh.pl (No such file or directory)

   at java.io.FileInputStream.open0(Native Method)
   at java.io.FileInputStream.open(FileInputStream.java:195)
   at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerProperties.getChineseTokenizationProcess(TreeTaggerProperties.java:81)
at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.tokenizeChinese(TreeTaggerWrapper.java:302)
at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.process(TreeTaggerWrapper.java:222)
at de.unihd.dbs.heideltime.standalone.components.impl.TreeTaggerWrapper.process(TreeTaggerWrapper.java:43)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishPartOfSpeechInformation(HeidelTimeStandalone.java:406)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.establishHeidelTimePreconditions(HeidelTimeStandalone.java:339)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:499)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.process(HeidelTimeStandalone.java:448)
at de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone.main(HeidelTimeStandalone.java:765)

So, we renamed segment-zh.perl to segment-zh.pl and then re-executed HeidelTime and got the following:

[HeidelTime] HeidelTime has not found any sentence tokens in this document. HeidelTime needs sentence tokens tagged by a preprocessing UIMA analysis engine to do its work. Please check your UIMA workflow and add an analysis engine that creates these sentence tokens.

I then tried running TreeTagger with Chinese as standalone and found that while it looks for the correctly titled segment-zh.perl, it looks for it in Tree Taggers cmd directory, however there are no instructions to put the Chinese Tokenizer in there and with the subdirectories created by the tokenizer's compressed file, it would not work anyways unless we manually moved it.

So, I created symlinks to the actual locations and then tried running TreeTagger again with:

echo "西门商场(2013年9月31日武装分子在肯尼亚首都内罗毕袭击的购物中心)的时候我身在艺术咖啡厅,这次我又住在半岛酒店……凌晨4点前一切 被炸成地狱的时候,我正准备离开我的房间" | cmd/tree-tagger-chinese

And get the following:

reading parameters ... Can't locate segmenter.pm in @INC (you may need to install the segmenter module) (@INC contains: ./cmd /etc/perl /usr/local/lib/perl/5.18.2 /usr/local/share/perl/5.18.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.18 /usr/share/perl/5.18 /usr/local/lib/site_perl .) at ./cmd/segment-zh.perl line 5. BEGIN failed--compilation aborted at ./cmd/segment-zh.perl line 5. tagging ... finished.

That's where I completed my troubleshooting. Hopefully there is a simple answer to this, but let me know if I need to do anything further.

Thanks!

jzell commented 9 years ago

Hey,

did you try the instructions and tokenization package linked in our readme.txt?

The official Chinese tokenizer by Serge was changed a while ago which kind of ruined our integration, so he has allowed us to host an older version). Since the Chinese preprocessing with the TreeTagger script is kinda messy anyway (won't work at all under Windows, some troubles under Linux too), you may also want to look at trying out Stanford POS Tagger which provides a more robust processing experience (although potentially worse results than the TreeTagger variant provides).

poethan commented 5 years ago

hi, I am running treetagger for Chinese also. but I do not find the file segment-zh.pl and can you please let me know how to find it and where please?

poethan commented 5 years ago

it said: _treetagger/cmd/segment-zh.perl": No such file or directory

jzell commented 5 years ago

As per the readme file linked in the previous post:

    * (OPTIONAL) For Chinese documents, please get the Tokenizer and TreeTagger parameter file
      from Serge Sharoff's page http://corpus.leeds.ac.uk/tools/zh/:
      - wget http://corpus.leeds.ac.uk/tools/zh/tt-lcmc.tgz
      - wget https://drive.google.com/uc?id=0B1ZoOwaeRsbva2F3NThLd3ptRWM -O zh-tokenise.tgz
      Extract the Tokenizer into a new directory and TreeTagger parameter files like this:
      - mkdir chinese-tokenizer
      - tar -xzvf tt-lcmc.tgz
      - tar -xzvf zh-tokenise.tgz -C chinese-tokenizer
Ravan-laws commented 4 years ago

As per the readme file linked in the previous post:

    * (OPTIONAL) For Chinese documents, please get the Tokenizer and TreeTagger parameter file
      from Serge Sharoff's page http://corpus.leeds.ac.uk/tools/zh/:
      - wget http://corpus.leeds.ac.uk/tools/zh/tt-lcmc.tgz
      - wget https://drive.google.com/uc?id=0B1ZoOwaeRsbva2F3NThLd3ptRWM -O zh-tokenise.tgz
      Extract the Tokenizer into a new directory and TreeTagger parameter files like this:
      - mkdir chinese-tokenizer
      - tar -xzvf tt-lcmc.tgz
      - tar -xzvf zh-tokenise.tgz -C chinese-tokenizer

Dear author, I run process through Xshell which connect remote Linux server. here's my whole command:

java -jar de.unihd.dbs.heideltime.standalone.jar test.txt -l CHINESE -t NEWS -vv

As it ran tokenizeChinese function,it raised an "Broken pipe" or sometimes "Stream closed" exception,like this:

java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.tokenizeChinese(TreeTaggerWrapper.java:324)

Could you give me some advices,plz,thank you