JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
374 stars 96 forks source link

remove_corrupt_utf8() not working #41

Closed abieler closed 11 months ago

abieler commented 8 years ago

The function remove_corrupt_utf8() does not work under Julia v0.4.6. The problem is the line zeros(Char, endof(s)+1) where it complains that zero is not defined for type Char. When using UInt8 instead I could make it run without error, but please check if this does what it is supposed to do.

function remove_corrupt_utf8(s::AbstractString)
    r = zeros(UInt8, endof(s)+1)                                                                                          
    i = 0
    for chr in s
        i += 1
        r[i] = (chr != 0xfffd) ? chr : ' '
    end
    return utf8(r)
end

Note that on the return statement I got rid of the CharString() too.

If this is ok I can make another pull request.

Cheers, Andre

aviks commented 8 years ago

Sure, thanks. Looks OK. Note that utf8 is deprecated in 0.5, you'll need to use Compat.UTF8String. I've just fixed all the other deprecations on 0.5.

abieler commented 8 years ago

So in 0.5 I had to adapt further, due to

chr != 0xfffd being deprecated, however when doing UInt8(chr) != 0xfffd there are InexactError() if the character does not fit in UInt8, so I did try-catch.

Further not sure if the index stepping with i+1 was OK before, so put in nextind(s,i)

function remove_corrupt_utf8(s::AbstractString)
    r = zeros(UInt8, endof(s)+1)
    i = 1
    for chr in s
        try
          r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '
        catch
          r[i] = ' '
        end
        i = nextind(s,i)
    end
    return Compat.UTF8String(r)
end

Seems reasonable?

aviks commented 8 years ago

r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '

Not all unicode characters will fit in an UInt8. This line above will loose all non-ascii characters from the string, I think.

I'd use something like this:

function remove_corrupt_utf8(s::AbstractString)
           r = IOBuffer()
           i = 1
           for chr in s
              if chr != Char(0xfffd)
                 write(r, chr)
               end
           end
           return takebuf_string(r)
       end

Are there any tests for this?

mirestrepo commented 7 years ago

Are there any updates/resolutions on this?

rssdev10 commented 11 months ago

Should be working with Julia > 1.0 and implementation like:

function remove_corrupt_utf8(s::AbstractString)
    return map(x->isvalid(x) ? x : ' ', s)
end