No support for spanish é, ç and other letters.

codigomaye commented 6 months ago

Hey dear MommaWatasu,

I hope you can help me on this one please :sweat_smile:.

The problem

When I try to insert Spanish text in a p HTML tag, I get some error. Ex:

<p>Desde el corazón de Jerusalén, pasando por Galilea hasta llegar al desierto. Cada paso que des te hara sentir un verdadero discípulo de Cristo
</p>

The error:

│   exception =
│    BoundsError: attempt to access 19-element Vector{Union{AbstractString, Symbol}} at index [20]

Futher explanations:

Julia uses "byte" indexing for characters, instead of "character" indexing.

Example: The text "OteraEngine" has the following index:

[1] => O
[2] => t
[3] => e
[4] => r
[5] => a
[6] => E
[7] => n
[8] => g
[9] => i
[10] => n
[11] => e

Character t is the length of character O + 1. Because each of these character represent 1 byte. (So we can consider that each English alphabet letter is 1 byte long).

However, Spanish (french and many others) have alphabet letters that are 2 byte long. Which includes: ñ, é, ç, and others.

Example 2: The text "España" (which means spain) has the index:

[1] => E
[2] => s
[3] => p
[4] => a
[5] => n
[6] => ~
[7] => a

So, when I try to retrieve the index by using the lenght() function, I get into an error because text[5] and text[6] can't be separated. (This is how I understand it, I think there is a better way to explain it though)

Solution

Replace while i <= length(txt) to for i in eachindex(txt). This ensures that you get the character index each time. Which enables OteraEngine to parse characters of their languages. (I tried it and it worked!).

codigomaye commented 6 months ago

I just added the error description :sweat_smile:

codigomaye commented 6 months ago

I found the solution to the problem:

Julia uses "byte" indexing for characters, instead of "character" indexing.

Futher explanations:

Example: The text "OteraEngine" has the following index:

[1] => O
[2] => t
[3] => e
[4] => r
[5] => a
[6] => E
[7] => n
[8] => g
[9] => i
[10] => n
[11] => e

Character t is the length of character O + 1. Because each of these character represent 1 byte. (So we can consider that each English alphabet letter is 1 byte long).

However, Spanish (french and many others) have alphabet letters that are 2 byte long. Which includes: ñ, é, ç, and others.

Example 2: The text "España" (which means spain) has the index:

[1] => E
[2] => s
[3] => p
[4] => a
[5] => n
[6] => ~
[7] => a

So, when I try to retrieve the index by using the lenght() function, I get into an error because text[5] and text[6] can't be separated. (This is how I understand it, I think there is a better way to explain it though)

Solution

Replace while i <= length(txt) to for i in eachindex(txt). This ensures that you get the character index each time. Which enables OteraEngine to parse characters of their languages. (I tried it and it worked!).

MommaWatasu commented 6 months ago

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1. If you use the other version, please update the package and try again.

codigomaye commented 6 months ago

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1. If you use the other version, please update the package and try again.

Hey @MommaWatasu, Thanks for your reply.

The error I get is from both v0.5.1 and v0.5.0 (I tried both out of curiosity)

Reproducible steps:

Install OteraEngine latest version (Pkg.add("OteraEngine") installs v.0.5.1 by default).
Try to generate a template from the following text:

txt = "élena"
tmp = Template(txt, path = false)

This leads to an error BoundError. This is because of the indexing problem explained before. Now, a brief demonstration of why this happens:

julia> txt = "élena"
julia> i = 1
julia> while i <= length(txt)
    println(txt[i])
   i = i + 1
end

This example reproduce the while loop used it tokenizer() function, inside the parser.jl. Which can't parse the text, neither. Because at some point it will get to a 2-byte character (é), which index is not accessible using a length and i++ indexing as in other programming language due to the language design (which is the case for JavaScript)

Now, if we change a while loop and length with a for loop and eachindex:

julia> txt = "élena"
julia> i = 1
julia> for i in eachindex(txt)
    println(txt[i])
end

We get the desired result, without affecting the logic of the codes :+1: .

codigomaye commented 6 months ago

I rectified the while example above :+1:

codigomaye commented 6 months ago

Hi @codigomaye

I think the bug is the same to this one, and fixed in v0.5.1. If you use the other version, please update the package and try again.

I checked this issue right now.

More or less the same problem as in v.0.5.0.

v0.5.1 solves it for characters like à, but not for others like ñ and é. Because of how nextind works. I read this answer post on Julia Discourse to figure out the problem, I read the entire post till I came to the answer.

The solution in v.0.5.1 is still incomplete, as I couldn't render a Spanish web page until I did a dev OteraEngine, tweaked the tokenizer. Then it happily worked ☺️

MommaWatasu commented 6 months ago

Thanks for your great effort! I fixed tokenizer function and added some test for Japanese and Spanish text. But, if nextind is incomplete solution for this issue as you mentioned, current code may still have the problem. Could you check the code in master branch and tell me whether it still has bug or not? If you don't reply by the next day, I'll release this code as v0.5.2.

codigomaye commented 6 months ago

Hey @MommaWatasu , I will check it right away :+1:

codigomaye commented 6 months ago

Hey @MommaWatasu , now it works perfectly :+1: .

This is a screen capture of the rendered page, to make you feel happy for the great job you did :smile:

Screenshot

Have a nice day :smile:

MommaWatasu commented 6 months ago

Thanks for the screenshot, I'm glad to see how OteraEngine is used! Now v0.5.2 is available. Please update and enjoy using it!

MommaWatasu / OteraEngine.jl