Closed futurulus closed 5 years ago
Hi @futurulus ,
First of all thank you for you contribution. It's great to see people helping us grow NLP-Cube. This is a nice catch with the empty sentences. I see that the tests are failing because of a dependency that does not install correctly on CircleCI (nothing to do with your changes). I'm going to approve this PR without the checks and we are going to include it in a new release of the pip package (after we run some local tests). However, you will have to sign the Adobe CLA, before we can merge the change: http://opensource.adobe.com/cla.html
Let me know when you've done this, so I can re-run the CLA test.
Thanks again, Tibi
Great! Running the CLA by my employer. (Although given how simple the change is, maybe I should just reopen this as an Issue instead of a PR and link to a few possible places where you can make the "obvious" fix yourselves 😉)
@futurulus - your PR is now integrated in nlpcube-1.0.8.
Overview
In some cases (usually involving sequences of multiple whitespace characters), the tokenizer can produce sentences with zero tokens. This causes errors later in the pipeline, specifically the following:
This change removes empty sequences from the tokenization output.
Testing Instructions