Closed stweil closed 2 years ago
~The Microsoft compiler seems to require a different solution.~ That's fixed now by conditional compilation. Microsoft Visual Studio only supports atomic operations for C++, so the new C code is still not thread-safe with that compiler.
This pull request includes these changes:
l_int32 refcount
and l_uint32 refcount
by atomic_int refcount
. This fixes an issue with multithreaded programs (and most errors reported by Coverity Scan). atomic_int
requires a compiler with at least C11 support. This standard was introduced 10 years ago, but is still unsupported by Microsoft Visual Studio. Therefore a fallback to int
is used for such compilers.xxxGetRefcount
and xxxChangeRefcount
and replace them by direct operations on the refcount
member variables. Maybe those functions should be marked as deprecated and removed in future major versions of Leptonica.refcount
from the public API. Use three new header files which are used only internally for those data types. If desired, the new header files could be split in smaller units or be combined in a single internal header file, and other declarations could be moved from the public API to such internal header files as well.A few comments/questions:
I believe we've consistently put system includes like string.h ahead of allheaders.h.
Are you proposing to replace
by in general several "internal" files, such as
in all 510 .c files in src and prog ? If so, that doesn't make semantic sense: allheaders.h has nearly all the header information required for compiling: it includes nearly all the header files as well as all the prototypes.
How does calling it "pix_internal.h" make it different from calling it "pix.h"? This is C code -- there are no private fields.
Why are we making both the atomic ref counting and the reorganization with "internal" data at the same time, instead of separately? These are both significant changes throughout the code.
More questions:
Is stdatomic a C++ library? If so, what issues will arise from linking it? Such as:
stdatomic.h
is C11, so it is normal C code. Adding the related include statement and using atomic_int
for all refcount
member variables does not require changes in Makefile.am
, makefile.static
or any CMakeLists.txt
. Because of the broken Microsoft compilers which still don't fully support C11 some conditional compilation is needed.
Each time when Leptonica's data types are changed, that can currently break compatibility to older versions. That's caused by the fact that the current API exposes all implementation details because the API header file allheaders.h
includes files like pix.h
which includes the full definitions of each struct
. Technically that's not necessary. Those definitions can be done in separate internal header files like pix_internal.h
which are not part of the public API, so end users don't see them. Such internal header files can be changed without breaking compatibility. Of course the Leptonica code must use the internal header files where the internal information is needed. Ideally the public API should hide as many implementation details as possible.
Fixing atomic access for the refcount
variables can be implemented without introducing internal header files and breaking code because the new data type atomic_int
is also 32 bit, but then requires stdatomic.h
to be included in all compilations, not only when building Leptonica, but also for all user code.
It's certainly good that the interface is C, so that there shouldn't be any function name mangling. I was wondering about the stdatomic library. If it is C++, I have read that it changes the way you have to compile and link executables.
for example, I found this statement about linking C with C++ libraries. Don't know if it's relevant: "If your main is C, then you are probably OK except for static variables. Any constructors with your static variables are supposed to be called before main() start. This won't happen if C is your main. "
But perhaps this is not an issue.
There is no stdatomic library. Much of the stdatomic implementation is realized by macros and does not require a library at all, and the rest is part of the standard C library (which is already used by default).
nice!
Regarding your other comments / questions:
I believe we've consistently put system includes like string.h ahead of allheaders.h.
That's also my preferred way, but Tesseract meanwhile uses a different order, so maybe I was a little bit confused.
Are you proposing to replace #include "allheaders.h" by in general several "internal" files [...]
No. I mixed two changes here, namely atomic operations and hiding implementation details in private header files because adding atomic operations again required changes in the public API. We already had several discussions in the past that ideally public API files should rarely require any change, and I just wanted to show that it is possible to separate public and private header files.
If so, that doesn't make semantic sense: allheaders.h has nearly all the header information required for compiling: it includes nearly all the header files as well as all the prototypes.
That's exactly the problem: allheaders.h
is Leptonica's public API and should not expose implementation details like the member variables of struct Pix
. Only code which implements that functionality should get that details (and therefore requires the private header file). User code (and most code in prog
) does not need it.
How does calling it "pix_internal.h" make it different from calling it "pix.h"? This is C code -- there are no private fields.
Right. "Private" for C code simply means that such code is private by convention and not part of the public API. pix.h
would be private if it were removed from allheaders.h
and from the list of installed header files. That would be an even more radical solution than moving some definitions from pix.h
to the new pix_internal.h
.
Why are we making both the atomic ref counting and the reorganization with "internal" data at the same time, instead of separately?
We don't have to do that. It was just a test for me, because I did not like the idea of adding stdatomic.h
to the public API. But the reorganization should be addressed for the next major version of Leptonica (whenever it is planned).
Funny thing. I went to sleep last night thinking about this, and when I woke up I understood it.
I also understood work-arounds for my biggest concern, which is the large amount of leptonica code at Google, much of it not written by me and not using accessors. It is important that people don't need to modify existing code with updates of the library, especially since I'm no longer there to help out. The simplest work-around is to add the new internal files to alltypes.h.
I also was concerned about the modification to ref counting in the same context. If someone were to have directly modified the refcount field, that work-around would save them. I checked the prog directory, and there are no calls to SetRefcount and ChangeRefcount. I'll ask someone at Google to check for any use of them -- if none, I think we can toss the functions.
I'm completely on-board with these two changes. Making ref counts atomic is a clear improvement, achieved without adding complexity. Pulling the internals of the major data structures out of the "public" interface allheaders.h is a good idea, as long as they remain available if needed for existing applications. I still think it is best to push them sequentially.
I'd like to implement your internal vs external data approach in a slightly different way.
Pull out the internal struct definitions and put them (only) in the *_internal.h files. Then continue using allheaders.h explicitly for all source files in the library, and include the pix_internal.h (etc) files after it where needed. So pix_internal.h (etc) would not include allheaders.h or stdatomic.h.
Does that make sense? If so, I'll go ahead and give it a try.
Maybe a simpler approach would just remove most include statements from alltypes.h
. It only needs environment.h
, typedef
statements for some struct
and union
data types and anything else which should be visible to all API users. Then header files like pix.h
would become internal header files.
stdatomic.h
is not needed in the public API, but only used internally.
I'd like the generic algorithm containers to be public, as well as application data such as in dewarp and recog.
array and pix are good candidates for making internal files. bilateral is a possibility. ccbord as well. Perhaps eventually readbarcode, which should be entirely rewritten with a proper struct container.
The changes in this pull request were now handled by follow-up pull requests, so it can be closed.
Signed-off-by: Stefan Weil sw@weilnetz.de