JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

StringIndexError when trying to create a StringDocument based on a UTF8 string #255

Closed alexzandros closed 8 months ago

alexzandros commented 3 years ago

I'm trying to create a StringDocument based on a string that contains utf-8 characters, and all i'm getting is a StringIndexError

My code is as follows

str = "Lo que tengamos que hacer, apoyar, enteegar el ❤️ y el alma por nuestro país. Ivan es el Man. 👏👏👏#Duquepresidente https://t.co/Dr1LdTa5yQ"
sd = StringDocument(str)

And I get the following error

Error showing value of type StringDocument{String}:
ERROR: StringIndexError: invalid index [50], valid nearby indices [48]=>'❤', [51]=>'️'

Followed by a stack trace.

So, I need to know what is the best practice for working with utf strings.

Thanks in advance.

aviks commented 3 years ago

Can you paste the stack trace you saw? Looks like a bug on our side.

segunolulana commented 2 years ago

I also experienced the same issue. The text in question contains Less likely working with code I don’t like and the stacktrace is

ERROR: LoadError: StringIndexError: invalid index [38], valid nearby indices [36]=>'’', [39]=>'t'
Stacktrace:
  [1] string_index_err(s::String, i::Int64)
    @ Base ./strings/string.jl:12
  [2] SubString{String}(s::String, i::Int64, j::Int64)
    @ Base ./strings/substring.jl:32
  [3] SubString
    @ ./strings/substring.jl:38 [inlined]
  [4] SubString
    @ ./strings/substring.jl:44 [inlined]
  [5] remove_patterns(s::SubString{String}, rex::Regex)
    @ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:486
  [6] remove_patterns!
    @ ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:508 [inlined]
  [7] remove_patterns!(crps::Corpus{StringDocument{SubString{String}}}, rex::Regex)
    @ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:534
  [8] prepare!(crps::Corpus{StringDocument{SubString{String}}}, flags::UInt32; skip_patterns::Set{AbstractString}, skip_words::Set{AbstractString})
    @ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:415
  [9] prepare!
    @ ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:406 [inlined]
 [10] summarize(d::StringDocument{String}; ns::Int64)
    @ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/summarizer.jl:22
 [11] main()...
rssdev10 commented 8 months ago

Not reproducible with Julia 1.9 and TextAnalysis 0.8