DanBloomberg / leptonica

Leptonica is an open source library containing software that is broadly useful for image processing and image analysis applications. The official github repository for Leptonica is: danbloomberg/leptonica. See leptonica.org for more documentation.
Other
1.76k stars 387 forks source link

Use atomic operations for reference counters #614

Closed stweil closed 2 years ago

stweil commented 2 years ago

Signed-off-by: Stefan Weil sw@weilnetz.de

stweil commented 2 years ago

~The Microsoft compiler seems to require a different solution.~ That's fixed now by conditional compilation. Microsoft Visual Studio only supports atomic operations for C++, so the new C code is still not thread-safe with that compiler.

stweil commented 2 years ago

This pull request includes these changes:

DanBloomberg commented 2 years ago

A few comments/questions:

I believe we've consistently put system includes like string.h ahead of allheaders.h.

Are you proposing to replace

include "allheaders.h"

by in general several "internal" files, such as

include " x_internal" (x = pix, array, bytea, ccbord, ...)

in all 510 .c files in src and prog ? If so, that doesn't make semantic sense: allheaders.h has nearly all the header information required for compiling: it includes nearly all the header files as well as all the prototypes.

How does calling it "pix_internal.h" make it different from calling it "pix.h"? This is C code -- there are no private fields.

Why are we making both the atomic ref counting and the reorganization with "internal" data at the same time, instead of separately? These are both significant changes throughout the code.

DanBloomberg commented 2 years ago

More questions:

Is stdatomic a C++ library? If so, what issues will arise from linking it? Such as:

stweil commented 2 years ago

stdatomic.h is C11, so it is normal C code. Adding the related include statement and using atomic_int for all refcount member variables does not require changes in Makefile.am, makefile.static or any CMakeLists.txt. Because of the broken Microsoft compilers which still don't fully support C11 some conditional compilation is needed.

Each time when Leptonica's data types are changed, that can currently break compatibility to older versions. That's caused by the fact that the current API exposes all implementation details because the API header file allheaders.h includes files like pix.h which includes the full definitions of each struct. Technically that's not necessary. Those definitions can be done in separate internal header files like pix_internal.h which are not part of the public API, so end users don't see them. Such internal header files can be changed without breaking compatibility. Of course the Leptonica code must use the internal header files where the internal information is needed. Ideally the public API should hide as many implementation details as possible.

Fixing atomic access for the refcount variables can be implemented without introducing internal header files and breaking code because the new data type atomic_int is also 32 bit, but then requires stdatomic.h to be included in all compilations, not only when building Leptonica, but also for all user code.

DanBloomberg commented 2 years ago

It's certainly good that the interface is C, so that there shouldn't be any function name mangling. I was wondering about the stdatomic library. If it is C++, I have read that it changes the way you have to compile and link executables.

for example, I found this statement about linking C with C++ libraries. Don't know if it's relevant: "If your main is C, then you are probably OK except for static variables. Any constructors with your static variables are supposed to be called before main() start. This won't happen if C is your main. "

But perhaps this is not an issue.

stweil commented 2 years ago

There is no stdatomic library. Much of the stdatomic implementation is realized by macros and does not require a library at all, and the rest is part of the standard C library (which is already used by default).

DanBloomberg commented 2 years ago

nice!

stweil commented 2 years ago

Regarding your other comments / questions:

I believe we've consistently put system includes like string.h ahead of allheaders.h.

That's also my preferred way, but Tesseract meanwhile uses a different order, so maybe I was a little bit confused.

Are you proposing to replace #include "allheaders.h" by in general several "internal" files [...]

No. I mixed two changes here, namely atomic operations and hiding implementation details in private header files because adding atomic operations again required changes in the public API. We already had several discussions in the past that ideally public API files should rarely require any change, and I just wanted to show that it is possible to separate public and private header files.

If so, that doesn't make semantic sense: allheaders.h has nearly all the header information required for compiling: it includes nearly all the header files as well as all the prototypes.

That's exactly the problem: allheaders.h is Leptonica's public API and should not expose implementation details like the member variables of struct Pix. Only code which implements that functionality should get that details (and therefore requires the private header file). User code (and most code in prog) does not need it.

How does calling it "pix_internal.h" make it different from calling it "pix.h"? This is C code -- there are no private fields.

Right. "Private" for C code simply means that such code is private by convention and not part of the public API. pix.h would be private if it were removed from allheaders.h and from the list of installed header files. That would be an even more radical solution than moving some definitions from pix.h to the new pix_internal.h.

Why are we making both the atomic ref counting and the reorganization with "internal" data at the same time, instead of separately?

We don't have to do that. It was just a test for me, because I did not like the idea of adding stdatomic.h to the public API. But the reorganization should be addressed for the next major version of Leptonica (whenever it is planned).

DanBloomberg commented 2 years ago

Funny thing. I went to sleep last night thinking about this, and when I woke up I understood it.

I also understood work-arounds for my biggest concern, which is the large amount of leptonica code at Google, much of it not written by me and not using accessors. It is important that people don't need to modify existing code with updates of the library, especially since I'm no longer there to help out. The simplest work-around is to add the new internal files to alltypes.h.

I also was concerned about the modification to ref counting in the same context. If someone were to have directly modified the refcount field, that work-around would save them. I checked the prog directory, and there are no calls to SetRefcount and ChangeRefcount. I'll ask someone at Google to check for any use of them -- if none, I think we can toss the functions.

I'm completely on-board with these two changes. Making ref counts atomic is a clear improvement, achieved without adding complexity. Pulling the internals of the major data structures out of the "public" interface allheaders.h is a good idea, as long as they remain available if needed for existing applications. I still think it is best to push them sequentially.

DanBloomberg commented 2 years ago

I'd like to implement your internal vs external data approach in a slightly different way.

Pull out the internal struct definitions and put them (only) in the *_internal.h files. Then continue using allheaders.h explicitly for all source files in the library, and include the pix_internal.h (etc) files after it where needed. So pix_internal.h (etc) would not include allheaders.h or stdatomic.h.

Does that make sense? If so, I'll go ahead and give it a try.

stweil commented 2 years ago

Maybe a simpler approach would just remove most include statements from alltypes.h. It only needs environment.h, typedef statements for some struct and union data types and anything else which should be visible to all API users. Then header files like pix.h would become internal header files.

stdatomic.h is not needed in the public API, but only used internally.

DanBloomberg commented 2 years ago

I'd like the generic algorithm containers to be public, as well as application data such as in dewarp and recog.

array and pix are good candidates for making internal files. bilateral is a possibility. ccbord as well. Perhaps eventually readbarcode, which should be entirely rewritten with a proper struct container.

stweil commented 2 years ago

The changes in this pull request were now handled by follow-up pull requests, so it can be closed.