Closed slhowardESR closed 2 years ago
@slhowardESR I've not seen this before. I wonder if the client is spawning too many threads and making too many concurrent connections. Can you send me the code you are using? If you want, you can send it to me via slack so the code is kept private. I will run things on my side and see if I can recreate and diagnose the problem.
@slhowardESR after some investigation, there appears to be a number of different things happening on your processing runs:
The ResourceWarning: unclosed socket <zmq...
is likely not a problem. It is a warning that the underlying code is not closing a socket that it is no longer using. In this case, it is coming from ZeroMQ, which is likely coming from your Spyder environment.
The KeyError
crash is due to a bug in the SlideRule Python client that is triggered by a race condition in the code. I fixed the bug and will be creating a new release of the code today. You can do a git pull
on main
or wait for the conda update.
The underlying problem though is that server nodes are going down due to not having any available memory while processing your requests. From my testing, it looks like there are maybe three granules (maybe more) in your test runs that take so much memory to process that the backend nodes processing them go down. There is a short term fix for this that will work most of the time - but really the long term fix is to rework the backend server code to be more memory efficient.
sliderule.set_max_pending(1)
right after your call to icesat2.init...
; You will also have to add from sliderule import sliderule
to your imports at the top. This call makes it so that each backend server node only processes one granule at a time; the default is three. Since YAPC takes almost 100% of the CPU, processing one granule at a time did not seem to affect the overall processing time that much.The above snapshot shows a run with max pending set to 3, and then again with max pending set to 1. When set to 1, there are no servers that bounce, though some of the memory dips are pretty low.
The temporary fix of setting the max pending to 1 seems to have worked well. All future development on this issue will be tracked under ICESat2-SlideRule/sliderule#117
Hi JP,
I am doing some testing - preparing for the large run, and sometimes I get this problem, will kills the program.
`sys:1: ResourceWarning: unclosed socket <zmq.Socket(zmq.PUSH) at 0x19391b4c280> ResourceWarning: Enable tracemalloc to get the object allocation traceback sys:1: ResourceWarning: unclosed socket <zmq.Socket(zmq.PUSH) at 0x19391b4c400> ResourceWarning: Enable tracemalloc to get the object allocation traceback sys:1: ResourceWarning: unclosed socket <zmq.Socket(zmq.PUSH) at 0x195403fc640> ResourceWarning: Enable tracemalloc to get the object allocation traceback sys:1: ResourceWarning: unclosed socket <zmq.Socket(zmq.PUSH) at 0x19604735dc0> ResourceWarning: Enable tracemalloc to get the object allocation traceback Traceback (most recent call last):
File D:\Jupyter\sliderulework\Spyder_SR\SR_by_RGT_5files_print_status.py:108 in
main()
File D:\Jupyter\sliderulework\Spyder_SR\SR_by_RGT_5files_print_status.py:81 in main gdf = icesat2.atl06p(parmsyp, version=args.release,
File d:\jupyter\sliderule-python\sliderule\icesat2.py:881 in atl06p return parallelize(callback, atl06, parm, resources, asset)
File d:\jupyter\sliderule-python\sliderule\icesat2.py:597 in __parallelize result, resource = future.result()
File ~\anaconda3\envs\sliderule\lib\concurrent\futures_base.py:437 in result return self.__get_result()
File ~\anaconda3\envs\sliderule\lib\concurrent\futures_base.py:389 in __get_result raise self._exception
File ~\anaconda3\envs\sliderule\lib\concurrent\futures\thread.py:57 in run result = self.fn(*self.args, **self.kwargs)
File d:\jupyter\sliderule-python\sliderule\icesat2.py:459 in __atl06 rsps = sliderule.source("atl06", rqst, stream=True)
File d:\jupyter\sliderule-python\sliderule\sliderule.py:425 in source __clrserv(serv, stream)
File d:\jupyter\sliderule-python\sliderule\sliderule.py:184 in __clrserv server_table[serv]["pending"] -= 1
KeyError: 'http://34.212.131.26'`
I am not sure what is causing this. I can send you my code. I am basically trying to do SR-YAPC processing for individual rgt in region 10 and 12. I am running two regions, separate processes, at once. Some times it works. and Sometimes it crashes.
let me know if you need more info