Consider saving generated modules in utf-8

enthought / comtypes

A pure Python, lightweight COM client and server framework, based on the ctypes Python FFI package.

Other

292 stars 97 forks source link

Consider saving generated modules in utf-8 #592

Closed LeonarddeR closed 4 weeks ago

LeonarddeR commented 3 months ago

Wrapper modules are currently saved without specifying the encoding to save in, choosing a platform specific encoding. This basically means that the encoding the modules are saved in can differ per system. For example, on my Windows 11 system, when I have the Beta: Use Unicode UTF-8 for worldwide language support option enabled, locale.getencoding() returns cp65001. When I have that option disabled, the encoding will be cp1252.

Therefore, I propose always saving the generated modules with an utf-8 encoding.

junkmd commented 3 months ago

I believe that making the encoding of generated files always UTF-8 requires very careful discussion, as it may result in breaking backward compatibility.

comtypes generates Python module files based on the information from the COM type libraries in the executed environment. Some COM type libraries may contain characters that cannot be recognized with UTF-8. Including such characters in UTF-8 encoded .py files might cause some problems.

In someone's environment, not explicitly setting the encoding and leaving it their system-dependent might be a condition for the generated module to work correctly.

I mainly use a Japanese environment and often encounter encoding-related problems. Because of that, I believe that any changes to encoding settings should be made cautiously.

What do you think?

LeonarddeR commented 3 months ago

Isn't UTF-8 supposed to be able to encode every single character? I'd be curious to know any examples that might break.

junkmd commented 3 months ago

This is precisely dependent on the execution environment, making it difficult to create a reproducer. And what I am considering is the possibility of environment-dependent special characters within the COM type library.

BTW, I have looked the nvda PR mentioned in this issue. Is it not possible to resolve the issue by changing the encoding of source/comInterfaces/_944DE083_8FB8_45CF_BCB7_C477ACB2F897_0_1_0.py to UTF-8?

LeonarddeR commented 3 months ago

BTW, I have looked the nvda PR mentioned in this issue. Is it not possible to resolve the issue by changing the encoding of source/comInterfaces/_944DE083_8FB8_45CF_BCB7_C477ACB2F897_0_1_0.py to UTF-8?

Yes, I think that can be a fix for that particular issue.

junkmd commented 3 months ago

I think that encoding issues with wrapper modules only arise when managing the wrapper module internally within the project, as with NVDA. In such cases, as you mentioned, it is indeed a particular solution, but I think it would be sufficient to change the encoding of the wrapper module and save it in the repository.

Again, I am more concerned about the possibility that something that was working well in a user's environment might stop working, rather than the convenience that might be brought about by fixing the encoding. I fear the reports of regressions caused by encoding problems if we release without adding tests.

At the moment, I can't think of a way to guarantee that specifying the encoding won't cause problems. We might need a COM-type library with special characters or an execution environment other than the English environment provided by GHA.

Please forgive me for being sensitive about encoding issues. In Japanese environments, troubles related to file encoding are really common.

junkmd commented 1 month ago

@LeonarddeR

Are you still interested in this issue?

LeonarddeR commented 4 weeks ago

I think it's clear to me that's currently not feasible, we can live with that.

junkmd commented 4 weeks ago

Thank you for your reply.

I believe that in order to make this technically feasible, a deep understanding of COM type libraries, the character encodings they use, and their interaction with the default encoding of the runtime environment would be required, along with sufficient test cases to demonstrate that there are no issues.

I will close this issue, but if anyone has any proposals that could address the matters mentioned above, I would be happy to review them.