Closed bisoldi closed 9 years ago
Hey,
did you try the instructions and tokenization package linked in our readme.txt?
The official Chinese tokenizer by Serge was changed a while ago which kind of ruined our integration, so he has allowed us to host an older version). Since the Chinese preprocessing with the TreeTagger script is kinda messy anyway (won't work at all under Windows, some troubles under Linux too), you may also want to look at trying out Stanford POS Tagger which provides a more robust processing experience (although potentially worse results than the TreeTagger variant provides).
hi, I am running treetagger for Chinese also. but I do not find the file segment-zh.pl and can you please let me know how to find it and where please?
it said: _treetagger/cmd/segment-zh.perl": No such file or directory
As per the readme file linked in the previous post:
* (OPTIONAL) For Chinese documents, please get the Tokenizer and TreeTagger parameter file
from Serge Sharoff's page http://corpus.leeds.ac.uk/tools/zh/:
- wget http://corpus.leeds.ac.uk/tools/zh/tt-lcmc.tgz
- wget https://drive.google.com/uc?id=0B1ZoOwaeRsbva2F3NThLd3ptRWM -O zh-tokenise.tgz
Extract the Tokenizer into a new directory and TreeTagger parameter files like this:
- mkdir chinese-tokenizer
- tar -xzvf tt-lcmc.tgz
- tar -xzvf zh-tokenise.tgz -C chinese-tokenizer
As per the readme file linked in the previous post:
* (OPTIONAL) For Chinese documents, please get the Tokenizer and TreeTagger parameter file from Serge Sharoff's page http://corpus.leeds.ac.uk/tools/zh/: - wget http://corpus.leeds.ac.uk/tools/zh/tt-lcmc.tgz - wget https://drive.google.com/uc?id=0B1ZoOwaeRsbva2F3NThLd3ptRWM -O zh-tokenise.tgz Extract the Tokenizer into a new directory and TreeTagger parameter files like this: - mkdir chinese-tokenizer - tar -xzvf tt-lcmc.tgz - tar -xzvf zh-tokenise.tgz -C chinese-tokenizer
Dear author, I run process through Xshell which connect remote Linux server. here's my whole command:
java -jar de.unihd.dbs.heideltime.standalone.jar test.txt -l CHINESE -t NEWS -vv
As it ran tokenizeChinese function,it raised an "Broken pipe" or sometimes "Stream closed" exception,like this:
java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at de.unihd.dbs.uima.annotator.treetagger.TreeTaggerWrapper.tokenizeChinese(TreeTaggerWrapper.java:324)
Could you give me some advices,plz,thank you
When attempting to run HeidelTime with Chinese, we get a FileNotFoundException. We are testing on the following Chinese text:
I'm running the following command:
The config file points to the proper directory, however we get the below exception:
So, we renamed segment-zh.perl to segment-zh.pl and then re-executed HeidelTime and got the following:
I then tried running TreeTagger with Chinese as standalone and found that while it looks for the correctly titled segment-zh.perl, it looks for it in Tree Taggers cmd directory, however there are no instructions to put the Chinese Tokenizer in there and with the subdirectories created by the tokenizer's compressed file, it would not work anyways unless we manually moved it.
So, I created symlinks to the actual locations and then tried running TreeTagger again with:
And get the following:
That's where I completed my troubleshooting. Hopefully there is a simple answer to this, but let me know if I need to do anything further.
Thanks!