Closed LeonarddeR closed 4 weeks ago
I believe that making the encoding of generated files always UTF-8 requires very careful discussion, as it may result in breaking backward compatibility.
comtypes
generates Python module files based on the information from the COM type libraries in the executed environment.
Some COM type libraries may contain characters that cannot be recognized with UTF-8.
Including such characters in UTF-8 encoded .py
files might cause some problems.
In someone's environment, not explicitly setting the encoding and leaving it their system-dependent might be a condition for the generated module to work correctly.
I mainly use a Japanese environment and often encounter encoding-related problems. Because of that, I believe that any changes to encoding settings should be made cautiously.
What do you think?
Isn't UTF-8 supposed to be able to encode every single character? I'd be curious to know any examples that might break.
This is precisely dependent on the execution environment, making it difficult to create a reproducer. And what I am considering is the possibility of environment-dependent special characters within the COM type library.
BTW, I have looked the nvda PR mentioned in this issue.
Is it not possible to resolve the issue by changing the encoding of source/comInterfaces/_944DE083_8FB8_45CF_BCB7_C477ACB2F897_0_1_0.py
to UTF-8?
BTW, I have looked the nvda PR mentioned in this issue. Is it not possible to resolve the issue by changing the encoding of
source/comInterfaces/_944DE083_8FB8_45CF_BCB7_C477ACB2F897_0_1_0.py
to UTF-8?
Yes, I think that can be a fix for that particular issue.
I think that encoding issues with wrapper modules only arise when managing the wrapper module internally within the project, as with NVDA. In such cases, as you mentioned, it is indeed a particular solution, but I think it would be sufficient to change the encoding of the wrapper module and save it in the repository.
Again, I am more concerned about the possibility that something that was working well in a user's environment might stop working, rather than the convenience that might be brought about by fixing the encoding. I fear the reports of regressions caused by encoding problems if we release without adding tests.
At the moment, I can't think of a way to guarantee that specifying the encoding won't cause problems. We might need a COM-type library with special characters or an execution environment other than the English environment provided by GHA.
Please forgive me for being sensitive about encoding issues. In Japanese environments, troubles related to file encoding are really common.
@LeonarddeR
Are you still interested in this issue?
I think it's clear to me that's currently not feasible, we can live with that.
Thank you for your reply.
I believe that in order to make this technically feasible, a deep understanding of COM type libraries, the character encodings they use, and their interaction with the default encoding of the runtime environment would be required, along with sufficient test cases to demonstrate that there are no issues.
I will close this issue, but if anyone has any proposals that could address the matters mentioned above, I would be happy to review them.
Wrapper modules are currently saved without specifying the encoding to save in, choosing a platform specific encoding. This basically means that the encoding the modules are saved in can differ per system. For example, on my Windows 11 system, when I have the
Beta: Use Unicode UTF-8 for worldwide language support
option enabled,locale.getencoding()
returnscp65001
. When I have that option disabled, the encoding will becp1252
.Therefore, I propose always saving the generated modules with an utf-8 encoding.