HugoFara / lwt

Learn languages by reading! A language learning app stemmed from Learning with Texts (LWT).
https://hugofara.github.io/lwt/
The Unlicense
169 stars 19 forks source link

The RegExp may have bug to separate the words #91

Closed 99MengXin closed 1 year ago

99MengXin commented 1 year ago

Describe the bug The RegExp may have bug to separate the words.

To Reproduce Steps to reproduce the behavior:

  1. Go to Languages
  2. Click on Edit (In this case, English)
  3. Scroll down to RegExp Split Sentences, and set .!?:;
  4. Scroll down to RegExp Word Characters, and set \-\'a-zA-ZÀ-ÖØ-öø-ȳЀ-ӹ
  5. Click on Save
  6. Choose any English text and read
  7. See error

Expected behavior e.g. 'I should like it.' The word it in this sentence should be recognized to it but it. with a period. In other cases, symbols can be recognized perfectly. e.g. ?!;:

Screenshots

截圖 2023-02-16 08 24 20

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context This function works well in original LWT.

HugoFara commented 1 year ago

Hi! Here, I can give you a few suggestions: In step 4 you detail

Scroll down to RegExp Word Characters, and set \-\'a-zA-ZÀ-ÖØ-öø-ȳЀ-ӹ

Base on the provided RegExp, ' should be considered a word, which is useful in some contexts (english genitive such as "Tom's house" → "Tom's + house"). If it's a trouble to you, you can simply remove the apostrophe for words characters (use \-a-zA-ZÀ-ÖØ-öø-ȳЀ-ӹ).

In your context, I would suggest to use " instead of ' whenever possible.

Finally, I can look into the word parser if I have some time, but it a core function so it's really hard to change without breaking something for users.

I hope that helps you :smile: