fix header regex not recognizing bengali, tamil

thth commented 3 years ago

Table of contents weren't being generated for Bengali or Tamil, due to the regex for parsing headers not accounting for characters which would fall under the Unicode general category for marks.

I don't know Bengali, Tamil, or Unicode, so I have no idea why these languages have characters distinguished as marks but not others 🤷

I'm also not sure why the regex for recognizing headers, now ~r/<(h\d)>(["\p{L}\p{M}\s?!\.\/\d]+)(?=<\/\1>)/iu, is so specific. Perhaps it would be good to generalize it some? Currently it wouldn't catch any transcendent numerals, symbols, or punctuation (the other Unicode general categories).

kinson commented 3 years ago

Thanks @thth for finding and reporting this issue (on top of the other 3)! I will take a look at your remaining prs this weekend 🙌🏼

thth commented 3 years ago

Thanks! By the way, the hacktoberfest-accepted tags are spelt wrong 😅

kinson commented 3 years ago

Thanks! By the way, the hacktoberfest-accepted tags are spelt wrong 😅

Whoops! Thanks for raising that - I just went through and updated the tags by adding a new hacktoberfest-accepted label and then deleted the hacktoberest-accepted label!

elixirschool / school_house

fix header regex not recognizing bengali, tamil #164