blacksmithgu / obsidian-dataview

A data index and query language over Markdown files, for https://obsidian.md/.
https://blacksmithgu.github.io/obsidian-dataview/
MIT License
6.91k stars 408 forks source link

Add unicode flag to regex of 'regexmatch()' function (and all other functions which allow regex) #1656

Open cschaba opened 1 year ago

cschaba commented 1 year ago

Is your feature request related to a problem? Please describe.

I am German and I have some notes with filename using German Umlaut (e.g. Ü).

I want to generate a glossary page in Obsidian using the Dataview.

Currently I am using a query like this:

    ```dataview
    list glossary
    where regexmatch("^[A-Z][A-Z][A-Z]{0,4}$", file.name)

This query works fine, if the Filenames do contain only ASCII Characters.  But I have files containing e.g. german Umlaut like `Ü`.

I did some research and tried with regex101 what could be a solution and found a page describing how Regex and Unicode works together and I have created a testpage in regex101 to verify it, see links at the end.

**Describe the solution you'd like**

For this I want to use a Unicode Regex `\p{Lu}`, but it seems this is not interpreted by Obsidian as I get no output.  What works is to use the `[A-Z]` regex, but of course this does not match the `Ü`.

I like to use this query:
```dataview
list glossary
where regexmatch("^\p{Lu}\p{Lu}{0,4}.*$", file.name)
```

Testfile named `TÜBA.md`:

glossary: Türkiye Bilimler Akademisi (Turkish Academy of Sciences)

Test with umlaut



**Describe alternatives you've considered**

As workaround I use `[A-Z]`in the regex and change the filenames with e.g. `Ü` to `UE`, what is common practice at least in Germany.

But this a bit anoying to me. I think there will be also other users with international characters who would benefit from support the Unicode Patterns.

**Additional context**

- About Unicode in Regex see the descriptions here: https://www.regular-expressions.info/unicode.html
- And I provide a test here https://regex101.com/r/75AuuL/1 - The line with "TÜBA" should match

Thanks very much for this great plugin! :)
s-blu commented 1 year ago

Hello,

I do not quite get what your wanted output is here. This is how Regex behave, nothing dataview does on its own. When I get your shared link correctly, you are using the PHP flavor of regex, not the Javascript one. Obsidian is running on JS, therefore regular expressions are evaluated on the JS flavour. If I switch to JS flavor, your regex does not seem to be valid anymore. There seem to be some mixed support of the unicode flag, so I am not quite sure if the engine Obsidian is running on is capable of the unicode syntax.

But anyway, you should be able to declare [A-ZÄÖÜ] as your character sets, if umlauts are your only concern. Try if regexmatch("^[A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ]{0,4}$", file.name) does the job - and by the way, you should be able to write your regex a bit shorter like

regexmatch("^[A-ZÄÖÜ]{2,6}$", file.name)

Would that fulfill your requirement?

cschaba commented 1 year ago

Hi s-blu,

my point is that the Unicode Regex \p{Lu} seems to be not working.

Yes using the Umlaut like you have suggested would be a solution for me, but the clean way would be to use the Unicode Regex to allow any unicode character e.g. Chinese, Cyrillic or whatever. That was my point :)

About Unicode in Regex see the descriptions here: https://www.regular-expressions.info/unicode.html

thanks for reply

holroy commented 1 year ago

I second the ability to use the Unicode Property Escapes, so that one is able to match non-english, but still legal word characters (and a whole bunch of other stuff) through regex within Dataview.

For more information on the subject of Unicode Property Escapse, see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Unicode_Property_Escapes, and the list of links at the bottom. This feature request/issue is mainly targeting all non-english writers with a need to do regex against their local language, and the special characters that language has.

xDovos commented 1 year ago

i found another use where i need the unicode property. writing filename comparisons with filenames that are emoji infested. the bad line is that here is just a "g" and not "gu" https://github.com/blacksmithgu/obsidian-dataview/blob/b243f8ce78e08998479ae9d72a0cd11b21fd1c1a/src/expression/functions.ts#L527

fzadow commented 9 months ago

Here ist my use case, maybe it helps to clarify the issue:

In a dataview query I use

TABLE WITHOUT ID
    regexreplace(Tasks.text, "[📅⏳🛫⏫🔼🔽].*$", "") AS "Task",
...

The idea is to remove anything from the task name after (and including) one of the 6 unicode characters in the expression. While this works well, it will also match other characters such as 👤. A task such as "- [ ] Ask [[👤Anne]] for advice" will be rendered as "Ask [[".

Proposed solution

regexreplace() calls String.prototype.replace() without the u flag. Adding the flag in https://github.com/blacksmithgu/obsidian-dataview/blob/b87858020ae1b7d87d14a7f9f847bcd8cc5a7438/src/expression/functions.ts#L578 would solve the problem (I'm not saying it won't create a couple of new ones along the way, so maybe make that flag optional :) ).