Prevent empty sentences in tokenization

futurulus commented 5 years ago

Overview

In some cases (usually involving sequences of multiple whitespace characters), the tokenizer can produce sentences with zero tokens. This causes errors later in the pipeline, specifically the following:

File "/usr/local/lib/python3.6/dist-packages/cube/api.py" line 194 in __call__
    sequences = self._parser.parse_sequences(sequences)
File "/usr/local/lib/python3.6/dist-packages/cube/generic_networks/parsers.py" line 496 in parse_sequences
    predicted_tags = self.tag(new_sequence)
File "/usr/local/lib/python3.6/dist-packages/cube/generic_networks/parsers.py" line 226 in tag
    arc_matrix, aux_arc_matrix, proj_labels, softmax_morphology = self._predict_arc(seq)
File "/usr/local/lib/python3.6/dist-packages/cube/generic_networks/parsers.py" line 470 in _predict_arc
    s_max = dy.softmax(dy.concatenate(s_max))
File "_dynet.pyx" line 4605 in _dynet.concatenate
File "_dynet.pyx" line 4618 in _dynet.concatenate
AssertionError: List is empty, nothing to concatenate.

This change removes empty sequences from the tokenization output.

Testing Instructions

Check out master

Run the following minimal example script:

from cube.api import Cube
nlp = Cube()
nlp.load('en')
print(nlp('a.     There'))

Expected output: crash traceback above
Check out this branch and re-run the script
Expected output: a two-sentence CONLL-U parse

tiberiu44 commented 5 years ago

Hi @futurulus ,

First of all thank you for you contribution. It's great to see people helping us grow NLP-Cube. This is a nice catch with the empty sentences. I see that the tests are failing because of a dependency that does not install correctly on CircleCI (nothing to do with your changes). I'm going to approve this PR without the checks and we are going to include it in a new release of the pip package (after we run some local tests). However, you will have to sign the Adobe CLA, before we can merge the change: http://opensource.adobe.com/cla.html

Let me know when you've done this, so I can re-run the CLA test.

Thanks again, Tibi

futurulus commented 5 years ago

Great! Running the CLA by my employer. (Although given how simple the change is, maybe I should just reopen this as an Issue instead of a PR and link to a few possible places where you can make the "obvious" fix yourselves 😉)

tiberiu44 commented 5 years ago

@futurulus - your PR is now integrated in nlpcube-1.0.8.

adobe / NLP-Cube

Prevent empty sentences in tokenization #114

Overview

Testing Instructions