davraamides / todotxt-mode

MIT License
59 stars 12 forks source link

Allow projects and contexts to contain non-Latin characters #31

Closed dehero closed 3 years ago

dehero commented 3 years ago

Hello.

\b token at the end of regex searching pattern blocked Cyrillic and other non-Latin names for projects and contexts to be found and highlighted. Removing it solves the problem though now regex can eat more symbols than it was expected initially. I suppose it's worth.

Before: image

After: image

davraamides commented 3 years ago

Thanks, @dehero. Forgive my ignorance with Cyrillic languages! Before I accept your pull request, can you test tags, too (e.g. tag:value) as those regex patterns also begin and end with the \b token. I'd like to fix those at the same time, too, if needed. As I'm looking at my code, I can't remember why I think I needed the word boundary in the pattern but there may be some edge cases I need to consider.

dehero commented 3 years ago

Now I fixed tags too.

Before: image

After: image

Forgive my ignorance with Cyrillic languages!

Better to say that the Cyrillic languages were initially ignored by the developers of regular expressions.

I'm looking at my code, I can't remember why I think I needed the word boundary in the pattern but there may be some edge cases I need to consider.

Regarding boundary tokens, by removing them, we allow not only the use of non-Latin letters, but also use of any other non-whitespace characters, so these become valid:

+pro,ject, // project
@c*(n)tex: // context
\:$        // tag

Though todo.txt format has some sort of specification, I cannot find there details on which symbols are allowed or disallowed. Each editor or highlighter acts on it's own.

I generally think that todotxt-mode token parsing needs some more refinement for not-letter symbols. But for now we just fix a more significant issue. It's obvious that not-Latin letters should be allowed.