UlionTse / translators

🌏🌍🌎Translators🌎🌍🌏 is a library that aims to bring free, multiple, enjoyable translations to individuals and students in Python. Translators是一个旨在用Python为个人和学生带来免费、多样、愉快翻译的库。
https://pypi.org/project/translators/
GNU General Public License v3.0
1.62k stars 189 forks source link

[Bug]: Python 3 doesn't like re with Positive Lookbehind ? #145

Closed Cabu closed 11 months ago

Cabu commented 11 months ago

Debug Tips

What happened?

from:

import re
html_text = '<p>sentence 1</p><p>sentence 2</p>'
pattern = re.compile("(?:^|(?<=>))([\\s\\S]*?)(?:(?=<)|$)")  # TODO: <code></code> <div class="codetext notranslate">
sentence_list = list(set(pattern.findall(html_text)))

In Python 2.7.18, the re module return:

sentence_list = ['sentence 1', 'sentence 2']

In Python 3.7.16 and 3.11.3, the re module return.

sentence_list = ['<p>sentence 1', '<p>sentence 2']

Simplifying the RE seems to work well:

        pattern = re.compile("(?:^|>)([\\s\\S]*?)(?:<|$)")  # TODO: <code></code> <div class="codetext notranslate">

I have opened a bug in cpython in case the problem come from them: https://github.com/python/cpython/issues/109579

APP Version

5.8.3

Python Version

3.11

Runtime Environment

Linux CentOS (Default)

Country/Region

Belgium

Relevant log output

No response

Screenshots

No response

Code of Conduct

UlionTse commented 11 months ago

@Cabu Friend, I think whether the parttern>([\\s\\S]*?)< of greedy search is enough, even if it's not perfect. So try the new version 5.8.4 Support you find difference between py2 and py3.

UlionTse commented 11 months ago

@Cabu The pattern is used twice, findall and sub, not only findall, so long long ago the pattern is complicated.