Closed alexfridlyand closed 6 years ago
@alexfridlyand
hi~
i tried to run the baseline model described (https://github.com/tensorflow/models/tree/master/syntaxnet/g3doc/conll2017)
but there is a problem related 'utf8, std:out_or_range' in inference steps.
...
2017-04-01 09:57:58.442684: I syntaxnet/embedding_feature_extractor.cc:35] Features: input.focus;input.focus stack.focus stack(1).focus;stack.focus stack(1).focus
2017-04-01 09:57:58.442689: I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: lookahead;tagger;rnn-stack
2017-04-01 09:57:58.442692: I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;64;64
2017-04-01 09:57:58.442810: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
2017-04-01 09:57:58.442830: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: basic_string
INFO:tensorflow:Read 0 documents
...
since i haven't found the way to fix it, i decided to skip by dropping 'char2word' layer when building 'master_spec'.
after that, all works fine.
https://github.com/dsindex/syntaxnet#dragnn
if you are interested in training the Russian corpus and test,
download Russian UD corpus from http://universaldependencies.org
compile
$ pwd
/path/to/models/syntaxnet
$ bazel build -c opt //work/dragnn_examples:write_master_spec
$ bazel build -c opt //work/dragnn_examples:train_dragnn
$ bazel build -c opt //work/dragnn_examples:inference_dragnn
train
$ pwd
/path/to/work/UD_Russian
SRC_CORPUS_DIR=${CDIR}/UD_Russian
TRAIN_FILE=${DATA_DIR}/ru-ud-train.conllu.conv
DEV_FILE=${DATA_DIR}/ru-ud-dev.conllu.conv
$ nohup ./train_dragnn.sh -v -v &
test
$ cat textfile | ./test_dragnn.sh -v -v
note that again
loading downloaded model for annotation is not yet available now in here.
but i think https://github.com/tensorflow/models/tree/master/syntaxnet/dragnn/tools this original code may work well(i didn't test)
Thank you very much for such detailed response! I will reply shortly in case of issues, great stuff.
Got this error at inference stage (with Russian dataset trained on): UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128).
Added this to the test_dragnn.sh and it works:
reload(sys)
sys.setdefaultencoding('utf8')
@dsindex Maybe you could also help how to convert output to Brat standoff ann format to output in Brat? I commented ${CONLL2TREE} --alsologtostderr from test_dragnn for this, but then i need to convert CoNNL-U format to standoff, i'm trying with this repo: https://github.com/spyysalo/conllu.py
but getting multiple parse issues. Could you advice something?
@alexfridlyand
that is cool repo!
i am not sure about getting multiple parse issues
you mentioned.
but conllu.py
looks like taking file-based processing with two pass.
one is for text, other is for annotation. it is tricky..... ;;
i think we'd better to save conllu files(from test_dragnn.sh) and use conll.py.
$ cat file.txt | ./test_dragnn.sh > file.conllu
$ python conll.py/convert.py -o outdir file.conllu
if we want to run from on-line manner, we have to modify conllu.py/convert.py, conll.py/conllu/conllu.py it seems time-consuming.
by the way, i have a question about the brat tool. https://github.com/nlplab/brat/issues/1221 as this issue which i reported, i can't annotate relations. because there is no dialog action.
do you know how to fix it?
I use brat as compare only tool, if i will figure out - i'll let you know.
@dsindex same code as you wrote, i'm getting
conllu.conllu.FormatError: invalid CPOSTAG: PRP$ (line 4)
on file with Russian sentences.
@alexfridlyand thank you :)
hmm.... in UD_English and Korean corpus, there is no error.
i guess cpostag
is not right format
CPOSTAG_RE = re.compile(r'^[a-zA-Z]+$')
...
# some character set constraints
if not CPOSTAG_RE.match(self.cpostag):
raise FormatError('invalid CPOSTAG: %s' % self.cpostag)
here, self.cpostag was generated by from_string
method
def from_string(cls, s):
fields = s.split('\t')
if len(fields) != 10:
raise FormatError('got %d/10 field(s)' % len(fields), s)
fields[5] = [] if fields[5] == '_' else fields[5].split('|') # feats
fields[8] = [] if fields[8] == '_' else fields[8].split('|') # deps
return cls(*fields)
since i don't know exactly why such character in there,
do some filtering for fields
list is the way i'd like to take ;;
hope it helps.
Thank you, i think i'll just use second Russian treebank, which is much bigger and looks like with proper tags.
Hello! Amazing work! Could you please tell me how with your scripts to pass text file to the test dragnn script (Russian baseline model) and output in CoNLL format?
If possible please provide detailed instruction. Where to copy files and so on .... Thanks in advance!