dsindex / syntaxnet

reference code for syntaxnet
197 stars 57 forks source link

How to run conll17 dragnn baseline model? #21

Closed alexfridlyand closed 6 years ago

alexfridlyand commented 7 years ago

Hello! Amazing work! Could you please tell me how with your scripts to pass text file to the test dragnn script (Russian baseline model) and output in CoNLL format?

If possible please provide detailed instruction. Where to copy files and so on .... Thanks in advance!

dsindex commented 7 years ago

@alexfridlyand

hi~

i tried to run the baseline model described (https://github.com/tensorflow/models/tree/master/syntaxnet/g3doc/conll2017)

but there is a problem related 'utf8, std:out_or_range' in inference steps.

...
2017-04-01 09:57:58.442684: I syntaxnet/embedding_feature_extractor.cc:35] Features: input.focus;input.focus stack.focus stack(1).focus;stack.focus stack(1).focus
2017-04-01 09:57:58.442689: I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: lookahead;tagger;rnn-stack
2017-04-01 09:57:58.442692: I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;64;64
2017-04-01 09:57:58.442810: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
2017-04-01 09:57:58.442830: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: basic_string
INFO:tensorflow:Read 0 documents
...

since i haven't found the way to fix it, i decided to skip by dropping 'char2word' layer when building 'master_spec'.

after that, all works fine.

https://github.com/dsindex/syntaxnet#dragnn

if you are interested in training the Russian corpus and test,

  1. download Russian UD corpus from http://universaldependencies.org

  2. compile

    $ pwd
    /path/to/models/syntaxnet
    $ bazel build -c opt //work/dragnn_examples:write_master_spec
    $ bazel build -c opt //work/dragnn_examples:train_dragnn
    $ bazel build -c opt //work/dragnn_examples:inference_dragnn
  3. train

    • say, UD_Russian directory in the path
      $ pwd
      /path/to/work/UD_Russian
    • edit train_dragnn.sh
      SRC_CORPUS_DIR=${CDIR}/UD_Russian
      TRAIN_FILE=${DATA_DIR}/ru-ud-train.conllu.conv
      DEV_FILE=${DATA_DIR}/ru-ud-dev.conllu.conv
    • run
      $ nohup ./train_dragnn.sh -v -v &
  4. test

    • run
      $ cat textfile | ./test_dragnn.sh -v -v

note that again

loading downloaded model for annotation is not yet available now in here.

but i think https://github.com/tensorflow/models/tree/master/syntaxnet/dragnn/tools this original code may work well(i didn't test)

alexfridlyand commented 7 years ago

Thank you very much for such detailed response! I will reply shortly in case of issues, great stuff.

alexfridlyand commented 7 years ago

Got this error at inference stage (with Russian dataset trained on): UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128).

alexfridlyand commented 7 years ago

Added this to the test_dragnn.sh and it works:

reload(sys)
sys.setdefaultencoding('utf8')
alexfridlyand commented 7 years ago

@dsindex Maybe you could also help how to convert output to Brat standoff ann format to output in Brat? I commented ${CONLL2TREE} --alsologtostderr from test_dragnn for this, but then i need to convert CoNNL-U format to standoff, i'm trying with this repo: https://github.com/spyysalo/conllu.py

but getting multiple parse issues. Could you advice something?

dsindex commented 7 years ago

@alexfridlyand

that is cool repo!

i am not sure about getting multiple parse issues you mentioned. but conllu.py looks like taking file-based processing with two pass. one is for text, other is for annotation. it is tricky..... ;; i think we'd better to save conllu files(from test_dragnn.sh) and use conll.py.

$ cat file.txt | ./test_dragnn.sh > file.conllu
$ python conll.py/convert.py -o outdir file.conllu

if we want to run from on-line manner, we have to modify conllu.py/convert.py, conll.py/conllu/conllu.py it seems time-consuming.

by the way, i have a question about the brat tool. https://github.com/nlplab/brat/issues/1221 as this issue which i reported, i can't annotate relations. because there is no dialog action.

do you know how to fix it?

alexfridlyand commented 7 years ago

I use brat as compare only tool, if i will figure out - i'll let you know.

@dsindex same code as you wrote, i'm getting conllu.conllu.FormatError: invalid CPOSTAG: PRP$ (line 4) on file with Russian sentences.

dsindex commented 7 years ago

@alexfridlyand thank you :)

hmm.... in UD_English and Korean corpus, there is no error. i guess cpostag is not right format

CPOSTAG_RE = re.compile(r'^[a-zA-Z]+$')
...
        # some character set constraints
        if not CPOSTAG_RE.match(self.cpostag):
            raise FormatError('invalid CPOSTAG: %s' % self.cpostag)

here, self.cpostag was generated by from_string method

def from_string(cls, s):
        fields = s.split('\t')
        if len(fields) != 10:
            raise FormatError('got %d/10 field(s)' % len(fields), s)
        fields[5] = [] if fields[5] == '_' else fields[5].split('|') # feats
        fields[8] = [] if fields[8] == '_' else fields[8].split('|') # deps
        return cls(*fields)

since i don't know exactly why such character in there, do some filtering for fields list is the way i'd like to take ;;

hope it helps.

alexfridlyand commented 7 years ago

Thank you, i think i'll just use second Russian treebank, which is much bigger and looks like with proper tags.