Closed codigomaye closed 6 months ago
I just added the error description :sweat_smile:
I found the solution to the problem:
Julia uses "byte" indexing for characters, instead of "character" indexing.
Example: The text "OteraEngine" has the following index:
[1] => O
[2] => t
[3] => e
[4] => r
[5] => a
[6] => E
[7] => n
[8] => g
[9] => i
[10] => n
[11] => e
Character t
is the length of character O
+ 1. Because each of these character represent 1 byte. (So we can consider that each English alphabet letter is 1 byte long).
However, Spanish (french and many others) have alphabet letters that are 2 byte long. Which includes: ñ, é, ç, and others.
Example 2: The text "España" (which means spain) has the index:
[1] => E
[2] => s
[3] => p
[4] => a
[5] => n
[6] => ~
[7] => a
So, when I try to retrieve the index by using the lenght()
function, I get into an error because text[5] and text[6] can't be separated. (This is how I understand it, I think there is a better way to explain it though)
Replace while i <= length(txt)
to for i in eachindex(txt)
. This ensures that you get the character index each time. Which enables OteraEngine to parse characters of their languages. (I tried it and it worked!).
Hi @codigomaye
I think the bug is the same to this one, and fixed in v0.5.1. If you use the other version, please update the package and try again.
Hi @codigomaye
I think the bug is the same to this one, and fixed in v0.5.1. If you use the other version, please update the package and try again.
Hey @MommaWatasu, Thanks for your reply.
The error I get is from both v0.5.1 and v0.5.0 (I tried both out of curiosity)
Pkg.add("OteraEngine")
installs v.0.5.1 by default).txt = "élena"
tmp = Template(txt, path = false)
This leads to an error BoundError
. This is because of the indexing problem explained before. Now, a brief demonstration of why this happens:
julia> txt = "élena"
julia> i = 1
julia> while i <= length(txt)
println(txt[i])
i = i + 1
end
This example reproduce the while loop used it tokenizer()
function, inside the parser.jl
. Which can't parse the text, neither. Because at some point it will get to a 2-byte character (é
), which index is not accessible using a length and i++ indexing as in other programming language due to the language design (which is the case for JavaScript)
Now, if we change a while loop and length
with a for loop and eachindex
:
julia> txt = "élena"
julia> i = 1
julia> for i in eachindex(txt)
println(txt[i])
end
We get the desired result, without affecting the logic of the codes :+1: .
I rectified the while example above :+1:
Hi @codigomaye
I think the bug is the same to this one, and fixed in v0.5.1. If you use the other version, please update the package and try again.
I checked this issue right now.
More or less the same problem as in v.0.5.0.
v0.5.1 solves it for characters like à, but not for others like ñ and é. Because of how nextind
works. I read this answer post on Julia Discourse to figure out the problem, I read the entire post till I came to the answer.
The solution in v.0.5.1 is still incomplete, as I couldn't render a Spanish web page until I did a dev OteraEngine
, tweaked the tokenizer
. Then it happily worked ☺️
Thanks for your great effort!
I fixed tokenizer
function and added some test for Japanese and Spanish text. But, if nextind
is incomplete solution for this issue as you mentioned, current code may still have the problem.
Could you check the code in master branch and tell me whether it still has bug or not? If you don't reply by the next day, I'll release this code as v0.5.2.
Hey @MommaWatasu , I will check it right away :+1:
Hey @MommaWatasu , now it works perfectly :+1: .
This is a screen capture of the rendered page, to make you feel happy for the great job you did :smile:
Have a nice day :smile:
Thanks for the screenshot, I'm glad to see how OteraEngine is used! Now v0.5.2 is available. Please update and enjoy using it!
Hey dear MommaWatasu,
I hope you can help me on this one please :sweat_smile:.
The problem
When I try to insert Spanish text in a
p
HTML tag, I get some error. Ex:The error:
Futher explanations:
Julia uses "byte" indexing for characters, instead of "character" indexing.
Example: The text "OteraEngine" has the following index:
Character
t
is the length of characterO
+ 1. Because each of these character represent 1 byte. (So we can consider that each English alphabet letter is 1 byte long).However, Spanish (french and many others) have alphabet letters that are 2 byte long. Which includes: ñ, é, ç, and others.
Example 2: The text "España" (which means spain) has the index:
So, when I try to retrieve the index by using the
lenght()
function, I get into an error because text[5] and text[6] can't be separated. (This is how I understand it, I think there is a better way to explain it though)Solution
Replace
while i <= length(txt)
tofor i in eachindex(txt)
. This ensures that you get the character index each time. Which enables OteraEngine to parse characters of their languages. (I tried it and it worked!).