Upcoming ICU 72 may break tests

Heads up that we observed some tests breaking on our (old, 0.5.1) version of text2vec after updating to ICU72.

I haven't had time to dig into it extensively (and especially whether the HEAD version of text2vec is still affected), but the root cause will be similar to what we reported to tokenizers:

https://github.com/ropensci/tokenizers/issues/82

I am reporting despite a pure reproducible example because I saw this test still hard-coding the exact # of tokenized words (even though it's now marked dont-test):

https://github.com/dselivanov/text2vec/blob/311c4839e07be30ba273044f542b9281452083ca/tests/testthat/dont-test-vocab-corpus.R#L16

So I suspect the dev version is still susceptible. Note that because stringi bundles versioned copies of icu (e.g. icu69), it's also unlikely this will affect the CRAN version of text2vec until stringi updates their bundled version, but it could still affect users like us that link to a newer ICU version than that bundled with the CRAN sources.

Here's the output of testthat on 0.5.1 for reference:

=═ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-hash-corpus.R:19'): Unigram Hash Corpus construction ─────────
length(m@x) (`actual`) not equal to 140615L (`expected`).

  `actual`: 140614
`expected`: 140615
── Failure ('test-hash-corpus.R:37'): trigram hash-corpus construction ─────────
length(m@x) (`actual`) not equal to 591549L (`expected`).

  `actual`: 591559
`expected`: 591549
── Failure ('test-iterators.R:25'): ifiles ─────────────────────────────────────
nrow(v) (`actual`) not equal to 7448 (`expected`).

  `actual`: 7446
`expected`: 7448
── Failure ('test-iterators.R:38'): ifiles_parallel ────────────────────────────
nrow(v) (`actual`) not equal to 7448 (`expected`).

  `actual`: 7446
`expected`: 7448
── Failure ('test-vocab-corpus.R:16'): Vocabulary pruning ──────────────────────
length(vocab$term) (`actual`) not equal to 19297 (`expected`).

  `actual`: 19289
`expected`: 19297
── Failure ('test-vocab-corpus.R:19'): Vocabulary pruning ──────────────────────
max(vocab$term_count) (`actual`) not equal to 13224 (`expected`).

  `actual`: 13227
`expected`: 13224
── Failure ('test-vocab-corpus.R:86'): Unigran Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 141714L (`expected`).

  `actual`: 141713
`expected`: 141714
── Failure ('test-vocab-corpus.R:104'): bi-gram Vocabulary Corpus construction ──
sum(grepl("_", vocab$term, fixed = T)) (`actual`) not equal to 121333L (`expected`).

  `actual`: 121331
`expected`: 121333
── Failure ('test-vocab-corpus.R:105'): bi-gram Vocabulary Corpus construction ──
length(vocab$term) (`actual`) not equal to 121333L (`expected`).

  `actual`: 121331
`expected`: 121333
── Failure ('test-vocab-corpus.R:113'): bi-gram Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 220104L (`expected`).

  `actual`: 220109
`expected`: 220104
── Failure ('test-vocab-corpus.R:121'): Unigram + Bigram Vocabulary Corpus construction ──
length(vocab$term) (`actual`) not equal to 140630L (`expected`).

  `actual`: 140620
`expected`: 140630
── Failure ('test-vocab-corpus.R:128'): Unigram + Bigram Vocabulary Corpus construction ──
length(m@x) (`actual`) not equal to 361818L (`expected`).

  `actual`: 361822
`expected`: 361818

[ FAIL 12 | WARN 20 | SKIP 1 | PASS 132 ]

With such high numbers of vocab elements it's a bit tougher to confirm specifically what's changed, but I do see things like e-mail addresses and words with colons that were mentioned in the tokenizers issue as causing slight changes in stringi::stri_split_boundaries behavior, so I am reasonably confident that's the full extent of what's going on.

We could simply update the #s of tokenized words to match the new output, but this technically marks a breaking change, and it may be hard to anticipate how this would affect downstream model outputs, so I also leave it to the maintainers to decide the right path forward.

dselivanov / text2vec

Upcoming ICU 72 may break tests #337