Closed edhartnett closed 5 years ago
The only thing this defends against is LD_PRELOAD loading the wrong soname. Please don't do this as it's extremely inconvenient for Linux distributions and does not solve any known problem.
@DennisHeimbigner comments in #864 that the HDF5 error checking has caught error mismatches for him several times. I agree, that has also happened to me.
As alluded by @gsjaardema HPC systems often do not give users much visibility into the LD_LIBRARY_PATH. Users just load the netcdf module and the system sets something in LD_LIBRARY_PATH. Meanwhile, they use a netcdf.h that they find, and it may or may not be correct. That is the reason for the check in HDF5.
So @eli-schwartz can I ask you to elaborate on why this makes it extremely inconvenient for Linux distributions?
For completeness, let me propose this: The idea is to provide e.g. nc_open32 and nc_open64 (or whatever names) as well as the original nc_open. We can then do the following:
This means that any time code is recompiled with the new netcdf header containing the #define will translate to a call to nc_open64. Whereas any old code that uses the new library will call a version of nc_open that then calls nc_open32 as expected. Anyone see a flaw in this (aside from it being ugly)?
Soname versioning work for every other piece of software known to mankind in all of computing history...
This is inconvenient for Linux distributions, for the same reason it is inconvenient for regular users. It means that rather than trusting the declared compatibility versioning, you hard abort based on a completely different, unrelated versioning system, which isn't even incompatible.
The major.minor.patchlevel version for hdf5 is a superset of the times when actual ABI compatibility has been broken. It includes all soname changes, plus many more changes as well.
However, due to the use of hard aborts based on this entirely misplaced runtime check, the soname simply doesn't matter in the slightest, and matching sonames which should work fine, nevertheless require completely recompiling the entire hdf5 stack without exception. Recompiling the world on every patchlevel release is pretty annoying, it requires involved, domain-specific knowledge to know when you cannot trust the soname versions, and it's difficult to check which software is compiled against a patchlevel release whereas there are trivial tools to check which soname something is compiled against.
Then you end up with situations where some Linux distributions decide they don't want to play monkey games, and they patch out the check like this: https://salsa.debian.org/debian-gis-team/hdf5/blob/master/debian/patches/relax-version-check.patch (And they get the check wrong, but it doesn't matter because the only check that matters is the soname.)
If you don't trust sonames, why do you use them in the first place? If you do trust them but are worried about compile-time header mismatches, how does that translate to runtime checks, and either way, why not do compile-time checks against the thing which matters, that is, the ABI version also used by the soname?
I am not sure I understand.
why not do compile-time checks against the thing which matters, that is, the ABI version also used by the soname? I think the check has to be at runtime because I am assuming that it is possible to run a client against more than one .so for a given library used by the client. Is this last assumption false?
I think the check has to be at runtime because I am assuming that it is possible to run a client against more than one .so for a given library used by the client. Is this last assumption false?
Yes it is possible, but that is exactly what soname versionning is for: forbid running against an incompatible library.
If a software is compiled against libnetcdf.so.5
, it will not be able to run against libnetcdf.so.6
(unless you force it to using LD_PRELOAD). But libnetcdf.so.5
can be provided by e.g. version from 5.2.0 to 5.4.3, and then the .6
one starting with 5.5.0
until 6.0.0
for instance. The important point is to change it when the library is actually incompatible. Not when you release a new patch-level release fixing 3 or even 30 bugs. ;)
Comment for Ward: what has been our previous experience with bumping the libnetcdf.so major version?
Several thoughts:
If you have multiple versions of libhdf5.so being loaded by a single program, then you'll have multiple copies of the symbols, and the dynamic linker will hide some extra copies. IIRC it is an operating system implementation detail whether the first loaded or second loaded library wins.
Shortly after, your program will segfault or exhibit other signs of undefined behavior as it attempts to use symbols which it thinks are correct, in ways they were not meant to be used. This will happen regardless of your runtime checks, and in all probability, before you ever try running code containing runtime checks -- but we're dealing with undefined behavior here, so who knows?
This is not a sane environment to be working in. It's not an allowed environment. It happens when your application is correctly linked against two dependent libraries, and each is correctly linked against libhdf5, but wrongly linked against different versions of libhdf5. Each dependent library is therefore part of an entirely different, exclusive software stack. If one is in your LD_LIBRARY_PATH
, the other should not be, and this applies regardless of where you separately store your headers. Is this something that is a serious concern to hdf5 or netcdf users? Is this something which is even allowed by the working environment of hdf5 or netcdf users? How do you end up linking your application to multiple different hdf5-using dependencies acquired from multiple incompatible locations that all still exist?
I would not want to rely on a runtime check to catch this.
If that is the concern, then the hdf5 version checking code being held up as an example is wrong. It emits a fatal error claiming that the headers don't match the library, and you should recompile the application... but as far as I can tell, depending on which library was loaded first, the code which loads the data will work fine, but the code which processes it will silently load incompatible symbols and do who-knows-what to your data without the warnings being effective at all.
...
This goes far beyond mismatched headers.
As mentioned in https://github.com/Unidata/netcdf-c/issues/864#issuecomment-441465031
The use of symbol versioning would allow you to build part of your application against one version of hdf5 or netcdf, and part of your application against another version, and your application would still work.
@dennisheimbigner I’m still on thanksgiving travels so will weigh in more fully later. The soversioning is completely separate from the library versioning, and there are rigid rules for how major.minor.revision are incremented. I’ll post the link when Inhave the chance.
Also, we need to have a solution that works on Windows, OSX, and Cygwin.
I will catch up on this when I return to the office but the ‘against’ seems fairly straightforward. At a glance I’m not sure if this proposed fix addresses a hypothetical or demonstrated issue but I will review and weigh in (the issue is probably explained in #864). SO versioning is robust and anything that breaks or abandons it at this point would have to be very well justified.
Enjoy your weekend and I’ll review tomorrow.
Also, we need to have a solution that works on Windows, OSX, and Cygwin.
Cygwin works exactly the same way Linux works.
OSX uses libfoo.version.dylib
whereas Linux uses libfoo.so.version
, but aside for the order in which these appear, the same rules apply.
Windows uses private dependencies of everything, and manifests which describe exactly where that dependency is located, so chances of mysteriously loading multiple copies of the library via other libraries pinned the same way but depending on other versions, and having the symbols overwrite each other are... low?
@eli-schwartz OK, I get your point. Checking versions in this way breaks one of the key features of shared libraries. We don't want to do that.
Looks like eli gave all the arguments I would. Use the proper tool for the job. The hdf5 version check has been a pain in my side for the last 13 years that I've maintained the hdf5 package in Fedora. The thought of netcdf taking it up as well is just too much to bear.
So why did the HDF5 group do it? They must have had a specific reason for using header version as opposed to .so version? Has anyone ever seen an explanation from them?
Maybe back in 90s it seemed like a good idea. If you are not doing proper API/ABI management it may seem like a reasonable thing to do. But with the availability of modern tools it is completely the wrong thing to do.
OK, seems like the consensus is that this is not a good idea, because it makes shared libraries much less effective.
I will close this ticket.
Answer from hdfgroup:Hi Dennis,
On Mon, 26 Nov 2018, Dennis Heimbigner wrote:
The netcdf group at Unidata was looking at the header version checking that is done by the HDF5 library to ensure compatibility between the header used by client code and the API provided by the HDF5 library.
The question we have is: why did you do this rather than rely on the shared library major version number mechanism? Is the rationale behind the header version checking documented anywhere?
=Dennis Heimbigner Unidata
--
OK, well I guess that drives a stake through the heart of this idea. ;-)
I am breaking this discussion out of #864 so that it can be evaluated on its own merits (or lack of merit).
It has been proposed that we add a check at run-time to ensure that the version of the netcdf.h file matches the version of the library being used.
This checking is also done by HDF5.
The version check is proposed to be accomplished by making nc_open/nc_create into macros, which will call the real nc_open/nc_create with an added version parameter. This is explained well by @gsjaardema in #864.
@opoplawski objects to this feature. I invite him to list his reasons in this issue.
@gsjaardema uses a similar feature in his own exodus package. He thinks it saves hours of debugging when researchers get confused about what header file to use and what is in their LD_LIBRARY_PATH. Given the way that most supercomputers try and abstract the LD_LIBRARY_PATH from the users (instead of just explaining it), I can understand how this kind of confusion would arise.
It has been proposed that this feature is a necessary requirement for changing the mode flag size, however that is less clear. Changing the API in any way is correctly handled by shared library builds, as long as we upgrade the so number correctly. However, if we are to have such a version check, I am happy to implement it before attempting to change the size of the mode field.