Closed chriskempson closed 5 years ago
It should work. The default expression already contains a lot of unicode symbols such as long dashes, typographic quotation marks and so on. So the problem must probably be elsewhere. I suppose you texts are encoded in UTF-8?
Oh okay thank you. And yes, the texts were UTF-8 encoded. I couldn't figure out exactly what the issue was so I ended up using sentence splitter to split the texts by line breaks before importing it.
Can you paste here the whole regular expression you have used? Did you keep the replacement pattern intact?
Sure 😄
Find: (。|?|!|\.|\?|\!|…\s?|:)\s*((?:<[^>]*>|”|“|‛|’|‘|‟|′|″|‵|‶|»|\"|\'|\]|\)|\s)*)(\s+)((?:<[^\/][^>]*>|«|\"|„|”|“|‛|‚|’|‘|‟|′|″|‵|‶|\'|\(|\[|\—|-|—|\s)*)(\p{Lu})
Replace: \1\2#!#\4\5
Right, so you only changed the first group. I cannot think of any reason either... :-/
But the problem may be elsewhere. Do the sentences really start with letters classified as "uppercase letters" in Unicode? (That is the last part "\p{Lu}"). I guess there are actually no "upper case" letter in Japanese, are there?
Oh yes, I didn't spot that sorry! That's true, there are no uppercase letters in Japanese.
Just tried this again and it works perfectly now. My apologies for reporting this non-issue and thanks again for your all your help 😃
I'm trying to split Japanese text on
。
,?
and!
by adding。|?|!|
to the first replacement rule in the default profile but this dosen't seem to work. I was wondering if the regex parser is unable to process unicode?