bminixhofer / nlprule

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.
Apache License 2.0
594 stars 39 forks source link

oob access since 0.5.3 #61

Closed drahnr closed 3 years ago

drahnr commented 3 years ago

Since attempting to upgrade to 0.5.3 it consistently segfaults in https://github.com/bminixhofer/nlprule/blob/main/nlprule/src/rule/engine/composition.rs#L345-L347

See https://ci.spearow.io/teams/main/pipelines/cargo-spellcheck/jobs/pr-validate/builds/26

But the bug is in line 344 - where you should push the length in chars, not in bytes.

drahnr commented 3 years ago

Note that this is a general indicator of test coverage /w regards to emojis in need of expansion :upside_down_face:

bminixhofer commented 3 years ago

Hi, thanks for finding this. This really shouldn't have happened, but it's not an issue related to emojis. There is always exactly one sentence in the Rule unit tests, so the logic wasn't tested for correctness across sentences (which was not important before 0.5 since everything was relative to the sentence, not the input text). I'm fixing that now by shifting the input one char to the right in the tests to make sure nothing assumes the sentence to start from zero.

Don't worry about #62, I'll fix it and release today, there might be some other issues related to that too.

bminixhofer commented 3 years ago

Also, how hard would it be to extract plaintext from your test corpora in cargo-spellcheck? At the moment there are no tests checking just a large amount of random plaintext, I think that would be a good idea to add. There are quickcheck tests with random strings but that's not the same (since most of the time no rules will match for complete junk, otherwise these tests would've caught the bug too).

drahnr commented 3 years ago

Also, how hard would it be to extract plaintext from your test corpora in cargo-spellcheck? At the moment there are no tests checking just a large amount of random plaintext, I think that would be a good idea to add. There are quickcheck tests with random strings but that's not the same (since most of the time no rules will match for complete junk, otherwise these tests would've caught the bug too).

Hard, I run cargo-spellcheck on the rustc and juice codebase for validation.

drahnr commented 3 years ago

Hi, thanks for finding this. This really shouldn't have happened, but it's not an issue related to emojis. There is always exactly one sentence in the Rule unit tests, so the logic wasn't tested for correctness across sentences (which was not important before 0.5 since everything was relative to the sentence, not the input text). I'm fixing that now by shifting the input one char to the right in the tests to make sure nothing assumes the sentence to start from zero.

Maybe adding a test /w multiple sentences and multi-width unicode chars?

Don't worry about #62, I'll fix it and release today, there might be some other issues related to that too.

62 also avoids the collect, which avoids another intermediate allocation and does not break any test cases.

drahnr commented 3 years ago

And you are correct, line 344 was correct.