JuliaText / TextAnalysis.jl

Julia package for text analysis
373 stars 95 forks source link

PCRE compilation error: regular expression is too large at offset 592769 #258

Open jfb-h opened 2 years ago

jfb-h commented 2 years ago

Upon trying to remove sparse terms from a corpus via

remove_sparse_terms!(corp, .05)

I run into the following error message:

PCRE compilation error: regular expression is too large at offset 592769

    compile(::String, ::UInt32)@pcre.jl:128
    Regex(::String, ::UInt32, ::UInt32)@regex.jl:44
    _build_regex(::Languages.English, ::UInt32, ::Set{AbstractString}, ::Set{AbstractString})@preprocessing.jl:542
    var"#prepare!#14"(::Set{AbstractString}, ::Set{AbstractString}, ::typeof(TextAnalysis.prepare!), ::TextAnalysis.Corpus{TextAnalysis.StringDocument{String}}, ::UInt32)@preprocessing.jl:414
    remove_sparse_terms!(::TextAnalysis.Corpus{TextAnalysis.StringDocument{String}}, ::Float64)@preprocessing.jl:341
    top-level scope@Local: 18

Is this a bug or might this just mean there is something wrong with one of the documents? That might be a possibility as I'm dealing with patents which can get pretty messy.

I'm on Julia 1.6.1 and TextAnalysis v0.7.3.

aviks commented 2 years ago

Might well be a bug. Are the documents you use public? If so, would you be able to provide an example that fails?