ecmwf / metview-python

Python interface to Metview meteorological workstation and batch system
https://metview.readthedocs.io/en/latest/
Apache License 2.0
127 stars 32 forks source link

Avoid keeping calculated values in memory #49

Open snow73 opened 1 year ago

snow73 commented 1 year ago

Hello,

I am using metview-python to work with large amounts of GRIB data and during the processing a lot of memory is allocated (even though I eventually only extract some grid points). When inspecting the code it seems that by default the data is not kept in memory. But it seems that for calculated values this is not the case. Would it be possible to find a better way for memory handling, even if the performance may suffer (e.g. storage in temporary files, invalidating/removing values again from memory when no longer needed)?

Alternatively I tried to parallelize jobs using Python multiprocessing but can only use the "fork" method which retains the memory usage and the process looses the connection to metview when trying "spawn" instead.

Thanks Dennis

iainrussell commented 1 year ago

Hello,

Do you know if you are using 'pure Python' Metview, or do you have the binaries installed? Unless you have the environment variable METVIEW_PYTHON_ONLY set, you will be using the binaries rather than the code in metviewpy, and the memory management is different there (handled by the binaries). In general, we write results to disk when using the binaries, so it would be interesting to see what sort of calculations you are doing.

Thanks, Iain

snow73 commented 1 year ago

Hi Iain,

thanks for your feedback. I use the metview bundle and compile it just switching off the metview ui which I don't need for my usecase. Afterwards I install metview-python via pip. All in an docker environment.

I just tried and set the METVIEW_PYTHON_ONLY environment variable and tested again. This did not change the behavior. If I use "fork" or just let a single process iterate over the tasks, the memory usage grows. If I use "spawn" metview-python raises and exception:

Exception: Command "metview" did not respond within 8 seconds. This timeout is configurable by setting environment variable METVIEW_PYTHON_START_TIMEOUT in seconds. At least Metview 5 is required, so please ensure it is in your PATH, as earlier versions will not work with the Python interface.

I don't do special calculations I would say, just some unit conversion, calculating wind speed / direction from u/v and calculating the maximum/minimum/difference of two fields. Please see some examples below:

def kelvinToCelsius(data):
    return data - 273.15

def percentToOcta(data):
    return data * 8.0

def pascalToHectopascal(data):
    return data / 100.0

def meterToMillimeter(data):
    return data * 1000.0

def u10v10ToFF(data):
    u = mv.read(data=data, param="10u")
    v = mv.read(data=data, param="10v")
    ff = mv.sqrt(u * u + v * v)
    return ff

def u10v10ToDD(data):
    u = mv.read(data=data, param="10u")
    v = mv.read(data=data, param="10v")
    dd = mv.direction(u, v)
    return dd

def calculateLiquidPrecipitationEcmwf(data):
    tp = data.select(centre=model["CENTRE"], shortName="tp")
    sf = data.select(centre=model["CENTRE"], shortName="sf")
    fzra = data.select(centre=model["CENTRE"], shortName="fzra")

    rrrlq = tp - sf - fzra
    return rrrlq * 1000.0

Best regards Dennis

sandorkertesz commented 1 year ago

Hi Dennis,

Thank you for providing us with the sample code.

I wonder which Metview version you are using. You can get it with this code:

print(mv.version_info())

There were significant memory usage improvements last year in Metview, so if your version is too old it could be an explanation for your problem. However, seeing the code above the excessive memory usage is probably coming from the code performing the looping. Would it be possible to share some code showing the main iteration through your input data?

As Iain pointed out the Python code not releasing the fieldset memory is only used when METVIEW_PYTHON_ONLY is set. It is an experimental feature and if you build Metview from a bundle you most probably do not use it. However, to be on the safe side please try the following:

make sure METVIEW_PYTHON_ONLY is unset run mv.gradient(f) for a single field

If it works we can be sure that you are not using the pure Python implementation (since gradient() is not available in it). In this case we just need to focus on the binary (C++) version to find out why memory is accumulated for you.

Kind regards, Sandor

snow73 commented 1 year ago

Hi Sandor,

thank took a while but I now also prepared a bit of sample code. In the meantime I also upgraded to Metview-Bundle but that did not solve the issue.

The output from mv.version_info() is: {'metview_version': 51900.0, 'metview_major': 5.0, 'metview_minor': 19.0, 'metview_revision': 0.0, 'metview_dir': '/usr/local/lib/metview-bundle', 'eccodes_version': 23000.0, 'mars_version': 20230518.0, 'mir_version': 11605.0, 'metview_python_version': '1.14.0'}

mv.gradient(f) did not cause an error, so METVIEW_PYTHON_ONLY is not set.

To my repository at https://github.com/meteoiq/metview I added a few examples. /examples/test_version.py is showing the version output and testing if mv.gradient(f) works.

/examples/test_regrid.py in the repository is a trimmed down version of one of the use cases which results in the memory issues. In the example I download some grib fields from the open data server and then want to parallelise some calculations and regridding to a new grid. I output the memory usage of the child processes and they show the increasing memory consumption:

start processing msl: pid: 27 memory: 91,467,776
start processing tp: pid: 26 memory: 91,488,256
end processing msl: pid: 27 memory: 97,910,784
start processing 10u: pid: 27 memory: 97,910,784
end processing tp: pid: 26 memory: 107,147,264
start processing 10v: pid: 26 memory: 107,237,376
end processing 10u: pid: 27 memory: 98,099,200
start processing 2t: pid: 27 memory: 98,099,200
end processing 10v: pid: 26 memory: 107,986,944
end processing 2t: pid: 27 memory: 98,299,904 

In production I work with much higher resolution and many more parameters so that the memory usage adds up to 40 GB. But I hope the small example outlines the issue.

I also added an option to switch to spawn method to create new processes which results in an error:

Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/usr/local/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
     exitcode = _main(fd, parent_sentinel)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
     prepare(preparation_data)
   File "/usr/local/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
     _fixup_main_from_path(data['init_main_from_path'])
   File "/usr/local/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
     main_content = runpy.run_path(main_path,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "<frozen runpy>", line 291, in run_path
   File "<frozen runpy>", line 98, in _run_module_code
   File "<frozen runpy>", line 88, in _run_code
   File "/examples/test_regrid.py", line 1, in <module>
     import metview as mv
   File "/usr/local/lib/python3.11/site-packages/metview/__init__.py", line 44, in <module>
     raise exp
   File "/usr/local/lib/python3.11/site-packages/metview/__init__.py", line 28, in <module>
     from . import bindings as _bindings
   File "/usr/local/lib/python3.11/site-packages/metview/bindings.py", line 196, in <module>
     mi = MetviewInvoker()
          ^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.11/site-packages/metview/bindings.py", line 124, in __init__
     raise Exception(
 Exception: Command "metview" did not respond within 8 seconds. This timeout is configurable by setting environment variable METVIEW_PYTHON_START_TIMEOUT in seconds. At least Metview 5 is required, so please ensure it is in your PATH, as earlier versions will not work with the Python interface.
 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/usr/local/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
     exitcode = _main(fd, parent_sentinel)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
     prepare(preparation_data)
   File "/usr/local/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
     _fixup_main_from_path(data['init_main_from_path'])
   File "/usr/local/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
     main_content = runpy.run_path(main_path,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "<frozen runpy>", line 291, in run_path
   File "<frozen runpy>", line 98, in _run_module_code
   File "<frozen runpy>", line 88, in _run_code
   File "/examples/test_regrid.py", line 1, in <module>
     import metview as mv
   File "/usr/local/lib/python3.11/site-packages/metview/__init__.py", line 44, in <module>
     raise exp
   File "/usr/local/lib/python3.11/site-packages/metview/__init__.py", line 28, in <module>
     from . import bindings as _bindings
   File "/usr/local/lib/python3.11/site-packages/metview/bindings.py", line 196, in <module>
     mi = MetviewInvoker()
          ^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.11/site-packages/metview/bindings.py", line 124, in __init__
     raise Exception(
 Exception: Command "metview" did not respond within 8 seconds. This timeout is configurable by setting environment variable METVIEW_PYTHON_START_TIMEOUT in seconds. At least Metview 5 is required, so please ensure it is in your PATH, as earlier versions will not work with the Python interface.

I hope this provides sufficient information to further review this issue. If you require any more information, please reach out to me.

Thank you very much for your support, Dennis

sandorkertesz commented 1 year ago

Hi Dennis, Thank you for the elaborate example. I will look into it and try find out why the memory is accumulated. Sandor

sandorkertesz commented 1 year ago

Hi Dennis, please can you specify the grid resolution (both source and target) in your operational environment that results in the large memory consumption? I also wonder how many parallel processes you are running with regrid. Sandor

snow73 commented 1 year ago

Hi Sandor,

operationally we mainly interpolate O1280 to regular 0.125°x0.125°. We intend to run 8 parallel processes but due to the memory accumulation this exceeds the machine limitations, so currently we just run 2 parallel processes.

Best regards Dennis

sandorkertesz commented 1 year ago

Hi Dennis,

Many thanks! So each process is using up ~20 GB of memory?

I fixed a memory leak related to GRIB handling in the C++ Metview code (not yet released) but unfortunately it did not fix your case. I have not found the reason for the memory leak using your examples so far. Actually, the memory is accumulated very slowly, I tried to scale it up by repeating the processing loops without seeing any significant growth. On my Mac I even noticed that after a while the memory usage occasionally decreases!

Nevertheless, if I use the garbage collector explicitly by calling gc.collect() after writing the new fieldset onto disk the memory increases at a slower rate. Maybe you can utilise this in your environment. My other findings is that the memory increase is independent of the size of the GRIB fields. I wonder if you have the same experience.

Best regards, Sandor

sandorkertesz commented 1 year ago

Hi Dennis,

I have to admit that so far I have only run the tests as a single process (PARALLEL=False) and could not really reproduce a proper memory leak. However, when I used the PARALLEL=True mode the memory leak became obvious and as I do more and more iterations the memory is consistently increasing.

Unfortunately, Metview should not be used in parallel applications like that. It might work but it is completely unsupported and untested. The recommended way is to run two/more Metview Python scripts at the same time independently. My guess is that the memory would not accumulate in that case. So far this has been the best idea I could come up with.

Best regards, Sandor

snow73 commented 1 year ago

Hi Sandor,

thank you for the feedback. I will then refactor the processing so that metview-python runs in different processes. I am just wondering if you are aware of other ways or best practices to use MIR interpolation, easily apply arithmetics operations on gridded data and output grib again. My understanding was that Metview is the only possibility as MIR is not available as a separate package.

Best regards and thanks again for your help! Dennis

sandorkertesz commented 1 year ago

Hi Dennis,

Actually, MIR is available as a separate package: it is on github, and you can build it e.g as part of the metview bundle.

We are fully aware of the limitations of using Metview when it comes to parallel runs. There is a new project at ECMWF called earthkit that will offer similar functionalities in Python to Metview but will be fully scalable. However, it is not yet available.

As for Metview best practices, I can make the following recommendations.

  1. Metview GRIB processing is rather heavy on the IO (all results and intermediate/temporary steps are written to disk). So using a faster disk can improve the performance.

  2. You can extract the grib data into numpy arrays and do the computations in memory then write back the results to GRIB. See the values() and set_values() methods on a Fieldset.

  3. There are some built-in methods, which are more efficient than raw GRIB arithmetic. E.g. use speed() instead of mv.sqrt(u*u + v*v)

  4. Probably it is better to process multiple fields in one go than per individual field. E.g. on my laptop this code (fs is a fieldset):

r = mv.regrid(
    data=fs,
    grid =[0.2,0.2]
    )

runs almost 2.5 times faster than this:

    r = mv.Fieldset()
    for f in fs:
        g= mv.regrid(
            data=f,
            grid=[0.2,0.2]
        )
        r.append(g)
  1. Metview offers some parallelism when its Macro language is used as described here: https://confluence.ecmwf.int/display/METV/Efficiency+and+use+of+multiple+processors. However, the Python interface works in a different way: each call to the modules are synchronous. Since regrid is a module, in a Python script only one forked regrid process can run at a time.

Best regards, Sandor

snow73 commented 1 year ago

Thank you Sandor, that was very helpful. Feel free to close this issue. I am happy to test earthkit with my specific use case and give feedback once it is a bit more mature.

Best regards Dennis