Open djhoese opened 1 year ago
I'm still trying to make a smaller reproducible example beyond running all of Satpy's tests. However, Satpy's tests are very complicated as far as dependencies go. They include GDAL, netcdf4-python, xarray, dask, cython extensions, etc. So there could be a lot of things conflicting with libnetcdf beyond the changes to the builds.
Ah @simonrp84 just noted that these errors show up when his pytest command hangs:
Exception ignored in: <function File.close at 0x000002B262A5CFE0>
Traceback (most recent call last):
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5netcdf\core.py", line 1200, in close
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5py\_hl\files.py", line 578, in close
TypeError: 'NoneType' object does not support the context manager protocol
Exception ignored in: <function File.close at 0x000002B262A5CFE0>
Traceback (most recent call last):
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5netcdf\core.py", line 1200, in close
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5py\_hl\files.py", line 578, in close
TypeError: 'NoneType' object does not support the context manager protocol
Exception ignored in: <function File.close at 0x000002B262A5CFE0>
Traceback (most recent call last):
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5netcdf\core.py", line 1200, in close
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5py\_hl\files.py", line 578, in close
TypeError: 'NoneType' object does not support the context manager protocol
Exception ignored in: <function File.close at 0x000002B262A5CFE0>
Traceback (most recent call last):
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5netcdf\core.py", line 1200, in close
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5py\_hl\files.py", line 578, in close
TypeError: 'NoneType' object does not support the context manager protocol
Exception ignored in: <function File.close at 0x000002B262A5CFE0>
Traceback (most recent call last):
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5netcdf\core.py", line 1200, in close
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5py\_hl\files.py", line 578, in close
TypeError: 'NoneType' object does not support the context manager protocol
Exception ignored in: <function File.close at 0x000002B262A5CFE0>
Traceback (most recent call last):
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5netcdf\core.py", line 1200, in close
File "C:\Users\Simon\miniconda3\Lib\site-packages\h5py\_hl\files.py", line 578, in close
TypeError: 'NoneType' object does not support the context manager protocol
Ok I've got a reproducible example without Satpy:
import xarray as xr
def test_netcdf():
fk = "test.nc"
ds = xr.Dataset(
coords={"nx": [0], "ny": [0]},
attrs={
"source": "satpy unit test",
"time_coverage_start": "0001-01-01T00:00:00Z",
"time_coverage_end": "0001-01-01T01:00:00Z",
}
)
ds.to_netcdf(fk)
Run with python -c "import test_netcdf; test_netcdf.test_netcdf()"
and it will just hang when run on Windows.
This also affects GDAL CI on Windows whose tests involving netcdf hang: https://github.com/OSGeo/gdal/actions/runs/5589991097/jobs/10222968884 e.g the following hangs after the 100% completion:
gdal_translate autotest\gcore\data\byte.tif out.nc
Input file size is 20, 20
0...10...20...30...40...50...60...70...80...90...100 - done.
And I also traced it to the nompi_h5902ca5_107 --> nompi_h624ddae_109 upgrade
@rouault I haven't tested it yet, but I'm wondering if the 108 build is the actual issue. The 108 build should be the result of #180 being merged.
A quick test on my CI shows 108 as the problem. Might have to revert the S3 support. CC @zklaus @dennisheimbigner @dopplershift
Just to confirm, you are saying 107 works as expected?
@zklaus As far as I can tell, yes 107 works fine. 107 was released almost 2 weeks earlier and I never noticed any issues. Based on @rouault's gdal_translate
example, my guess is opening a NetCDF file for writing is the case that causes it to hang.
It is a bit strange because the test you propose doesn't seem to have anything to do with S3, so I am wondering if it's a random upgrade in the environment that is causing this. To test, I'll add the test as a test to the recipe and see if de-/activating makes a difference. If the trace given above is any indication, h5netcdf or h5py might be involved.
I also thought h5netcdf or h5py could be the issue, but when I narrowed it down to my simple Python example it seems less likely. I can try doing one with only importing the lower-level netcdf4-python library. If @rouault knows the exact part of GDAL that does the NetCDF writing, or if there is a test function that writes a NetCDF file, maybe that would also trigger it and remove Python from the equation entirely.
The exception I posted earlier is very strange, but may be a conflict between h5py and netcdf-python talking to HDF5/NetCDF-C internals. As far as runtime environment, it is exactly the same except for that libnetcdf version. Build environment...might be a different story. I suppose the logs for those are long gone though.
so I am wondering if it's a random upgrade in the environment that is causing this.
From my manual testing, it is clearly the nompi_h5902ca5_107 --> nompi_h624ddae_108 upgrade. Manual installation of 107 results in OK behaviour. Manual installation in 108 results in hangs, both when reading or writing with GDAL utilities. Interstingly the ncdump and ncgen utilities don't seem to be affected
@zklaus Got my python script down to this:
def test_netcdf4_python():
from netCDF4 import Dataset
nc = Dataset("test_netcdf4_python.nc", mode="w")
nc.close()
I threw the .close()
conditionally in my own tests expected a closed file to not hang, but 3 minutes of running this script in my CI and I think it is safe to say it is hanging. I'll try removing some dependencies from the environment to triple check that it isn't some weird conflict with some other library. I'm hoping the above netcdf4-python code translates easily to C/C++.
Thanks, @djhoese, that's a nice and compact test! I have plugged your earlier test into the recipe in #183, but curiously, the tests seem not to be run on Windows?! Does anyone know what's up with that? I note that there is a run_test.sh
script, but no run_test.bat
, but the tests are in the meta.yaml
anyways.
@zklaus should we mark the Windows builds >=107 as broken until we figure this out? I can do that if you all agree.
should we mark the Windows builds >=107 as broken
should be: >= 108 . 107 is fine
Heads up @WardF
Is there any chance someone can do the following:
Warning, the trace output may be voluminous.
If someone can tell me how to compile my little C program above with conda-forge's preferred compilers and just the command line (something I can put into CI) we may be able to reduce the amount of output by a lot assuming the test C program triggers the same hanging issue.
Is there any chance someone can do the following:
I tried to do this now in #185. Let me know what you think!
Just to let everyone know, I will be on vacation from tomorrow until 23. August. Feel free to take over my branches and to play around.
I'm afraid I don't have the skills needed to make headway on this. But it seems to be jamming up our ability to migrate to new hdf5, so it would be great if we can make some progress on it. Since @zklaus is still on vacation, is there anyone else with the necessary skills available?
I'm taking a look at this so that we can hopefully figure out why simply enabling the ncZarr S3
support would cause any hangs, particularly on Windows. Thanks all for the diagnostic work put in and, more importantly, your patience. I'll try to convert the python script above to a C program I can work with in Windows.
As somebody who doesn't use Python for their day-to-day work 😅, is there any chance I might be able to get an environments.yml
file for the environment you use to replicate this issue, @djhoese? This will bootstrap my efforts to replicate this in Python, before I start trying to replicate it in C. Thanks :).
I think you could replicate with:
conda create -n bad python=3.11 netcdf4
conda install -n bad -c conda-forge/label/broken libnetcdf=4.9.2=nompi_h624ddae_109
@WardF Note the smaller example code in this comment:
https://github.com/conda-forge/libnetcdf-feedstock/issues/182#issuecomment-1643931549
This should mean your environment can be created with only netcdf4
as the requested package. That should pull in libnetcdf
(this C library) and python
itself. Actually I'll just paste the script here too:
def test_netcdf4_python():
from netCDF4 import Dataset
nc = Dataset("test_netcdf4_python.nc", mode="w")
nc.close()
test_netcdf4_python()
Edit: Added the actual call of the function.
Is there any chance someone can do the following:
I tried to do this now in #185. Let me know what you think!
This appears to not produce anything, so it is not what I need.
@WardF Note the smaller example code in this comment:
This should mean your environment can be created with only
netcdf4
as the requested package. That should pull inlibnetcdf
(this C library) andpython
itself. Actually I'll just paste the script here too:def test_netcdf4_python(): from netCDF4 import Dataset nc = Dataset("test_netcdf4_python.nc", mode="w") nc.close() test_netcdf4_python()
Edit: Added the actual call of the function.
Thanks!
@DennisHeimbigner The log rotated off since it's been a month. I tried to get CI to re-run but failed. Might be best to see if @WardF can reproduce locally.
We (Ward and I) really need to see the configure options used to create libnetcdf.
I made a dummy PR: https://github.com/conda-forge/libnetcdf-feedstock/pull/187 Hopefully the CI there will tell you what you need to know?
@DennisHeimbigner In addition to the job that @xylar posted, the build script for that job was:
mkdir %SRC_DIR%\build
cd %SRC_DIR%\build
set BUILD_TYPE=Release
:: set BUILD_TYPE=RelWithDebInfo
:: set BUILD_TYPE=Debug
rem to be filled with mpi options
set PARALLEL=""
rem manually specify hdf5 paths to work-around https://github.com/Unidata/netcdf-c/issues/1444
cmake -LAH -G "NMake Makefiles" ^
%CMAKE_ARGS% ^
-DCMAKE_INSTALL_PREFIX="%LIBRARY_PREFIX%" ^
-DCMAKE_PREFIX_PATH="%LIBRARY_PREFIX%" ^
-DCMAKE_BUILD_TYPE=%BUILD_TYPE% ^
-DBUILD_SHARED_LIBS=ON ^
-DBUILD_UTILITIES=ON ^
-DENABLE_DOXYGEN=OFF ^
-DENABLE_TESTS=ON ^
-DENABLE_EXTERNAL_SERVER_TESTS=OFF ^
-DENABLE_DAP=ON ^
-DENABLE_DAP_REMOTE_TESTS=OFF ^
-DENABLE_HDF4=ON ^
-DENABLE_NETCDF_4=ON ^
-DENABLE_PLUGIN_INSTALL=ON ^
-DPLUGIN_INSTALL_DIR=YES ^
-DENABLE_CDF5=ON ^
-DENABLE_BYTERANGE=ON ^
-DENABLE_NCZARR=on ^
-DENABLE_NCZARR_ZIP=on ^
-DENABLE_NCZARR_S3=on ^
-DENABLE_NCZARR_S3_TESTS=off ^
-DENABLE_S3_SDK=on ^
-DHDF5_C_LIBRARY="%LIBRARY_LIB:\=/%/hdf5.lib" ^
-DHDF5_HL_LIBRARY="%LIBRARY_LIB:\=/%/hdf5_hl.lib" ^
-DHDF5_INCLUDE_DIR="%LIBRARY_INC:\=/%" ^
-DCMAKE_C_FLAGS="-DH5_BUILT_AS_DYNAMIC_LIB" ^
%PARALLEL% ^
%SRC_DIR%
if errorlevel 1 exit \b 1
cmake --build . --config %BUILD_TYPE% --target install
if errorlevel 1 exit \b 1
:: We need to add some entries to PATH befo
I'll note that the only "relevant" (to my eye) changes in that PR were:
-DENABLE_NCZARR_S3=on \
-DENABLE_S3_SDK=on \
I'm able to replicate this in Python (thanks for the bootstrap, all), but the hang happens after nc.close()
returns. At least, I infer that from adding a print("Finished")
call after nc.close()
The following also hangs:
def test_netcdf4_python():
from netCDF4 import Dataset
print("Finished")
test_netcdf4_python()
It looks like the issue is a side-effect of the import
statement; it doesn't block "Finished" from printing, but something, somewhere, has not yet returned (I infer). Interesting.
@WardF Interesting. If you remove everything but the import (and dedent it) so it just looks like:
from netCDF4 import Dataset
does that also hang? I guess this would mean something not getting garbage collected properly in the python library (threads not being killed?).
@WardF Interesting. If you remove everything but the import (and dedent it) so it just looks like:
from netCDF4 import Dataset
does that also hang? I guess this would mean something not getting garbage collected properly in the python library (threads not being killed?).
It does, in fact, still hang.
Your supposition sounds right to me; I've just started digging into the AWS SDK documentation looking to see if there is something subtle happening that I can suss out, but if it's Python garbage collection, is that something we are able to do anything about?
I'll keep digging, and will see if I can replicate this in C, but the fact it's triggered by import
is a bit of a wrinkle.
I guess that, at this point, we should probably open an issue in netcdf4-python. Or at least loop them in here. I don't think the tests upstream covers those options we just enabled.
I'll keep digging, and will see if I can replicate this in C, but the fact it's triggered by
import
is a bit of a wrinkle.
It has likely nothing to do with python. It's definitely reproducible in C/C++ since GDAL C++ command line utilities stall on exit when opening a netcdf file
I'll keep digging, and will see if I can replicate this in C, but the fact it's triggered by
import
is a bit of a wrinkle.It has likely nothing to do with python. It's definitely reproducible in C/C++ since GDAL C++ command line utilities stall on exit when opening a netcdf file
That's what makes it an interesting wrinkle; not sure what's happening in terms of the import
under the hood but it isn't immediately obvious what, if anything, is happening here. I'll keep poking around at it in the meantime, but it's not as straightforward as I'd have hoped :)
Ward- There is an experiment you can try to see if it is aws-sdk-cpp. You will need to do the following:
Ward- There is an experiment you can try to see if it is aws-sdk-cpp. You will need to do the following:
- use the current libnetcdf master branch.
- use the -DENABLE_S3_INTERNAL=on (cmake) or the --enable-s3-internal (automake) option when building the the netcdf library.
- Run the python program to see if it still hangs.
Thanks for the tip; I'll give it a try when I replicate the python script in C. Or, if I get to the point where I need to figure out how to get Python to invoke the netcdf.dll I'm compiling :). Thanks!
I'm able to recreate a hang now in the netCDF-C tests. Interesting; it gives me a good place to start.
I'm able to recreate a hang now in the netCDF-C tests. Interesting; it gives me a good place to start.
Thanks @WardF for digging into this! It is not an easy debugging for sure! (Hoping for an easy fix though ;-p)
@DennisHeimbigner I'm going to open an issue over on the Unidata/netcdf-c page. I'm seeing a hang in the first test run (tst_create_files
), that turns into a thrown Exception when running under Visual Studio. Screenshot here for reference, but it appears there is an issue with what's being passed to Aws::InitAPI(ncs3options)
. It is strange this is only manifesting under Windows, and I haven't stepped through the entire test yet, but I wanted to make a note of things while I had it in front of me.
This obviates the Python
connection (much to my relief, I'm much more at home with C/C++ XD). Thanks all for your help and patience, we'll get this sorted out.
I'm able to recreate a hang now in the netCDF-C tests. Interesting; it gives me a good place to start.
Thanks @WardF for digging into this! It is not an easy debugging for sure! (Hoping for an easy fix though ;-p)
Hoping it's one of those cases like the other 99%, where the difficulty is in diagnosing, but the issue is straightforward. I have to shift context for a little bit, but immediately I see that a NULL pointer is being passed as part of ncs3options
. I have no reason to think that this is the issue, other than NULL pointers are a great place to start looking when you are seeing access violations. But at least we've wrestled this into a debugging environment I'm able to wring some information out of. Thanks for your patience!
Hi folks, I'm wondering at what point we decide that an easy fix is not in the cards and roll back the changes so conda-forge can move on with various migrations that are stuck because of this issue. Is there any sign that a fix is imminent?
I certainly don't mean to imply that I don't appreciate the work being done to debug this issue. Far from it! But I appreciate that it can be hard to find these issues and that we all have other demands on our time as well.
> I see that a NULL pointer is being passed as part of ncs3options
Where are you seeing this?
It should not happen. The declaration of ncs3options is:
> static Aws::SDKOptions ncs3options;
which is a static object (i.e. not a pointer).
But if it is, then this could be the problem.
@xylar IMO, it's perfectly reasonable to revert and move on until this is properly fixed upstream. But I'll leave it to those who are willing to do the work of reverting (and possibly un-reverting...).
Hello all. Having popped back up the stack, I believe we may have a fix for this at the underlying C library. This was an interesting issue to try to diagnose and fix. As usual, the fix on our end (for the short term) was far easier than the effort put into figuring out what was going on in the first place.
I'm not sure immediately how to test our fix against the libnetcdf-feedstock
, but it boils down thusly. On Non-Linux platforms, -DENABLE_S3=TRUE
needs to be paired with -DENABLE_S3_INTERNAL=TRUE
. This is true for MacOS as well as Windows. For whatever reason, we are observing hangs when linking against the AWS C++ SDK on these platforms. I've been able to confirm, at least, that these issues are not specific to netCDF. But they are also not universal. Maddening, and we will get it sorted out. In the meantime, @DennisHeimbigner showed foresight in his implementation of an internal S3 interface layer.
Other PR's and issues have cropped up while I was focused on this, but I will be triaging and patching ASAP so that we might get a v4.9.3
release out.
Thanks @WardF, that's exciting :). I am trying it out in #185. Just to confirm, that means we should have -DENABLE_S3_SDK=FALSE
as well, right?
Hm. Unfortunately, it does not seem to work.
Solution to issue cannot be found in the documentation.
Issue
The CI in GitHub Actions for the Satpy library have started hanging on Windows. It seems to be related to the specific build of libnetcdf. With:
it works fine. With the same code and same dependencies except for the newer build of libnetcdf:
Pytest finishes running Satpy's tests (successfully) but then never exits. There is no other difference from what I can tell except the libnetcdf build. A fellow contributor @simonrp84 was able to reproduce this on his local Windows machine. His environment is what's providing the output for the below conda commands.
Otherwise, here is a passing Satpy CI job:
https://github.com/pytroll/satpy/actions/runs/5577565489/jobs/10190534313
And a hanging one:
https://github.com/pytroll/satpy/actions/runs/5594450997/jobs/10229326506
Installed packages
Environment info