Splitting Japanese Text

czcorpus / InterText_editor

Editor for aligned parallel texts (personal desktop application).

http://wanthalf.saga.cz/intertext

Other

19 stars 1 forks source link

Splitting Japanese Text #2

Closed chriskempson closed 5 years ago

chriskempson commented 5 years ago

I'm trying to split Japanese text on 。, ？ and ！ by adding 。|？|！| to the first replacement rule in the default profile but this dosen't seem to work. I was wondering if the regex parser is unable to process unicode?

wanthalf commented 5 years ago

It should work. The default expression already contains a lot of unicode symbols such as long dashes, typographic quotation marks and so on. So the problem must probably be elsewhere. I suppose you texts are encoded in UTF-8?

chriskempson commented 5 years ago

Oh okay thank you. And yes, the texts were UTF-8 encoded. I couldn't figure out exactly what the issue was so I ended up using sentence splitter to split the texts by line breaks before importing it.

wanthalf commented 5 years ago

Can you paste here the whole regular expression you have used? Did you keep the replacement pattern intact?

chriskempson commented 5 years ago

Sure 😄

Find: (。|？|！|\.|\?|\!|…\s?|:)\s*((?:<[^>]*>|”|“|‛|’|‘|‟|′|″|‵|‶|»|\"|\'|\]|\)|\s)*)(\s+)((?:<[^\/][^>]*>|«|\"|„|”|“|‛|‚|’|‘|‟|′|″|‵|‶|\'|\(|\[|\—|-|—|\s)*)(\p{Lu})

Replace: \1\2#!#\4\5

wanthalf commented 5 years ago

Right, so you only changed the first group. I cannot think of any reason either... :-/

But the problem may be elsewhere. Do the sentences really start with letters classified as "uppercase letters" in Unicode? (That is the last part "\p{Lu}"). I guess there are actually no "upper case" letter in Japanese, are there?

chriskempson commented 5 years ago

Oh yes, I didn't spot that sorry! That's true, there are no uppercase letters in Japanese.

Just tried this again and it works perfectly now. My apologies for reporting this non-issue and thanks again for your all your help 😃