DanBloomberg / leptonica

Leptonica is an open source library containing software that is broadly useful for image processing and image analysis applications. The official github repository for Leptonica is: danbloomberg/leptonica. See leptonica.org for more documentation.
Other
1.74k stars 387 forks source link

always use the UTF8/ANSI Win32 APIs (MSVC, Unicode builds) #681

Closed GerHobbelt closed 1 year ago

GerHobbelt commented 1 year ago

always use the UTF8/ANSI Win32 APIs, whether the code is built in Ansi/MBCS or Unicode (16-bit TCHAR) mode (MSVC: /Unicode flag)


These show up when compiling the code in /Unicode mode, which is our in-house setting for all projects. Don't know if this 'reachable' via the provided cmake, as I don't use that one for Windows.

DanBloomberg commented 1 year ago

UTF8 is not supported anywhere in leptonica.

Leptonica was initially developed before UTF8 usage became common. I can see that the changes in this PR are minimal, but it's what I call "the camel's nose under the tent." To support it at this time would be going down a very deep and expensive rabbit hole, and there is no reason to do it. Any more than supporting multiple word filenames for I/O, for example.

Surely your in-house projects can get around this restriction.

GerHobbelt commented 1 year ago

but it's what I call "the camel's nose under the tent."

😄

Surely your in-house projects can get around this restriction.

Well, the story here is that I've been burned pretty badly over the years by code/libraries/DLLs which (it then would turn out) are compiled with slightly different microsoft compiler settings -- losses measurable in both multiple man-weeks and receding/graying hair -- so I now make sure all projects I have control over use the exact same compiler settings (and compiler) for the entire application, always. It costs, but the horribly weird and damn hard to track down failures are still considered worse in my book: it drives you nuts.

Against that background, leptonica is built with the same settings as a bunch of other libs, and some of those just happen to need Microsoft cl.exe's /Unicode flag. Which has, among its side-effects, the effect of automagickally aliasing all string-accepting Win32 API calls with their 'W' (wide) versions, while non-'unicode' builds default to legacy 'A' (Ansi/MBCS) Win32 API calls. I.e.: there exists no such thing as CreateDirectory on Windows: you either are linked to CreateDirectoryA or CreatedirectoryW, thanks to a zillion mapping #defines in the Windows system header files. If you don' say nothin', you get grade 'A' American legacy mode. 😉

Meanwhile, everybody is free to call those 'A' and 'W' variants from anywhere they like, in any mix, within the same compiled code base.

Which is then sold as 'Unicode' vs. 'Ansi/legacy/MBCS', which, AFAICT, is still partly true (though large parts of Win32/64 APIs now do support UTF8), but the bottom line key argument here to explicitly invoke the 'A' variants always is so you are guaranteed char-based string type compatibility, no matter what you do to your compiler settings. No side effects (except when folks start feeding you a ton of Unicode peculiars, but I consider those out of scope anyway) and no compiler errors for those oddballs that have set /Unicode in the IDE or build command line.

Anyway. That's the unvarnished story.

It's up to you, it is not intended to have leptonica support non-ascii named image filenames/paths, but I do understand you can see the 'camel' from where you are looking at this. So, ultimately, you're the boss (besides, I am just passing by and lingering at your oasis on my way to other horizons) and this is what a local git fork is for anyway: so I can apply horrid tweaks that are of local interest only.

I rest, your honor. 😉


DanBloomberg commented 1 year ago

Wow. I am, as the Brits say, gobsmacked by your detailed explanation!

So, I was clearly wrong. The camel has come nowhere near the tent. The only issue with going ahead is one you brought up initially -- whether or not those ascii-oriented functions are accessible from cmake when building on Windows.

@zdenop @stweil Is it possible for one of you patch Ger's PR and make sure this works with a standard Windows build?

GerHobbelt commented 1 year ago

Is it possible for one of you patch Ger's PR and make sure this works with a standard Windows build?

&

Meanwhile, everybody is free to call those 'A' and 'W' variants from anywhere they like, in any mix, within the same compiled code base.

together mean: if this PR doesn't compile/build without errors in the windows CI build(s) you already have, then it means I effed up. Using CreateDirectoryA instead of Createdirectory (etc.etc.) should mean nothing changes for the existing build environments that already work, while it will help folks like me (who tweak the build envs to use cl /Unicode for whatever reason) by not breaking then either.

If you would try a cl /Unicode MSVC build without this patch, then you'll be pelted with compiler errors -- due to the silent mapping of CreateDirectory to CreatedirectoryW in this scenario.

🤔 Or did you mean you wanted the current CI builds extended by including a /Unicode build there alongside the current ones? (I wonder if I misunderstood your question there...)

DanBloomberg commented 1 year ago

I misunderstood you, and thought you said some MSVC builds might fail with your patch. So these substitutions all look fine. Thank you for your patience!