elixirschool / school_house

The new era of Elixir School now powered by @phoenixframework
https://elixirschool.com/
Apache License 2.0
152 stars 48 forks source link

Table of contents fails on international characters #111

Closed jaimeiniesta closed 3 years ago

jaimeiniesta commented 3 years ago

Detected on https://rocketvalidator.com/s/4ec0b390-f775-4db1-87a4-ced8f5ece619/p/4686420/i

It looks like the tables of contents for lessons does not generate the headings when those contain non-English characters.

While in the legacy site it works fine:

https://elixirschool.com/es/lessons/specifics/mnesia/#inicialización-de-datos-y-migración

the beta site doesn't work OK:

https://beta.elixirschool.com/es/storage/mnesia

Captura de pantalla 2021-07-29 a las 13 22 03

Headings containing non-English characters like "inicialización" are not being generated.

kinson commented 3 years ago

I took a look at this and the culprit appears to be the @headers_regex in the SchoolHouse.Content.Lesson module. The regex scans the body of the lesson and finds header tags to construct the TOC. Right now, the regex does not include accented characters in the main capture group.

I am not familiar with capturing accented characters in regex so I went to Stack Overflow of course and found this suggestion for matching all characters (accented and unaccented): A-Za-zÀ-ÖØ-öø-ÿ . If we update the regex used in the Lesson module, it would look like:

# current regex
@headers_regex ~r/<(h\d)>(["\w\s?!\.\/\d]+)(?=<\/\1>)/i

# proposed regex
@headers_regex ~r/<(h\d)>(["A-Za-zÀ-ÖØ-öø-ÿ\s?!\.\/\d]+)(?=<\/\1>)/i

I can go ahead and create a pr for this and add tests for this use case specifically, but before I do I want to see if there are any suggestions or improvements we can make to this regex update.

jaimeiniesta commented 3 years ago

Sounds fine, here are some accented character I can think of, to include in your tests:

áéíóú
àèìòù
âêîôû
äëïöü
ñ ç

and in uppercase:

ÁÉÍÓÚ
ÀÈÌÒÙ
ÂÊÎÔÛ
ÄËÏÖÜ
Ñ Ç
jaimeiniesta commented 3 years ago

Also, found this - there's a regex match for international characters!

https://davidwalsh.name/regex-accented-letters

kinson commented 3 years ago

Thanks for sharing that link, that's a much better approach. I tried using \p{Letter} and got an error but I was able to find the solution pretty quickly: \p{L}