NCAS-CMS / cf-python

A CF-compliant Earth Science data analysis library
http://ncas-cms.github.io/cf-python
MIT License
126 stars 19 forks source link

speed of cfa #736

Open JonathanGregory opened 8 months ago

JonathanGregory commented 8 months ago

Dear @davidhassell and @sadielbartholomew

A few months ago I recall David reporting much faster time for cfa processing pp files. I've installed the latest version of cf-python and dependencies, I believe:

>>> cf.environment(paths=False)
Platform: Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
HDF5 library: 1.12.2
netcdf library: 4.9.3-development
udunits2 library: libudunits2.so.0
esmpy/ESMF: not available
Python: 3.9.13
dask: 2023.7.0
netCDF4: 1.6.4
psutil: 5.9.0
packaging: 21.3
numpy: 1.25.1
scipy: 1.10.0
matplotlib: 3.5.2
cftime: 1.6.2
cfunits: 3.3.6
cfplot: not available
cfdm: 1.11.1.0
cf: 3.16.1

$ cfa
Using cf-python library version 3.16.1 at /home/users/sws02jmg/.local/lib/python3.9/site-packages/cf

In /storage/basic/baobab/jonathan/general/exprzb.000100 on the RACC I am executing cfa -f CFA4 -o nca *.pp. The directory contains 42,000 pp files, each containing one pp field. So far, it has been executing for a couple of hours. Should it take this long?

Best wishes and thanks

Jonathan

davidhassell commented 8 months ago

Thanks, Jonathan. I shall investigate ...

davidhassell commented 8 months ago

However, at only 4 minutes to aggregate on the fly ...

davidhassell commented 8 months ago

Here are my aggregate/write times:

In [19]: %time f = cf.read('*.pp')
CPU times: user 3min 45s, sys: 2.36 s, total: 3min 47s
Wall time: 3min 54s

In [20]: len(f)
4069

In [21]: %time cf.write(f, 'delme.nca', cfa=True)
CPU times: user 1h 37min 15s, sys: 56.8 s, total: 1h 38min 12s
Wall time: 1h 39min 22s

In [22]: !du -sh delme.nca
25M delme.nca
JonathanGregory commented 8 months ago

Dear @davidhassell

Thanks for the tests. Four minutes is quick for aggregation. That is an impressive speedup, indeed. However, it's too long to wait for accessing a dataset when doing interactive analysis. If you could speed it up by another factor of 100, it would be fine. :smile:

My test on racc-login-2 is still running. After nearly a day, it's written 1.9 Mbyte. Presuming it's trying to produce the same 25 Mbyte file as your test did, it will take more than three weeks to complete, which is too long to wait even for a batch job. Do you understand how it can take three weeks, or even 100 minutes, to write a netCDF file of 25 Mbyte? I haven't seen it yet, but I guess it probably contains a few hundred fields, doesn't it, metadata only of course.

To make the pph file for this directory takes about 10 minutes. This is simply a concatenation of the pp headers produced by reading all the pp files. du -sh pph gives 1.5M of actual disk space, du -sh --apparent-size pph gives 11M, which is what you'd expect for 42,000 headers of 256 bytes each plus block control words. Presumably it gets compressed by the file system owing to zeros and repetition. How can the CFA file be more than twice as big as as the pph file? The aggregation should have made it much smaller, shouldn't it?

Best wishes

Jonathan

JonathanGregory commented 8 months ago

Some more information. I can ncdump the CFA4 file which is being generated, and I find it has so far produced 2800 fields. A file of 25M would therefore contain 37,000 fields, which is quite similar to 42,000. This seems to suggest that it's not aggregating at all, and producing one output CF field for each input pp field. There are 210 2D pp fields in each pp file, and 68 distinct stashcodes, so I think that after aggregation we should have 68 CF fields.

... We've just discussed this. Your experiment shows that it's not aggregating the specific humidity fields. That would explain why there are so many output fields. It does aggregate all the others, you say, but yet it still takes 30 minutes to write the 67 (I suppose) aggregated fields to the CFA file, without data.

JonathanGregory commented 8 months ago

cfa has finished! It was only about 2 days, not 3 weeks, probably because most of the fields were aggregated, as you found yesterday. In the end there are 4072 fields in the file, which can be explained as 4000 for non-aggregated specific humidity, and 72 for the aggregated quantities. The file is 4.7 Mbyte actual disk space, 26.7 Mbyte apparent disk space, probably the same as yours. As we discussed yesterday, it's another question why the file took 1.5 h write on your laptop, but 2 days on RACC, which is not generally slow for writing netCDF. But perhaps we will understand this soon.

davidhassell commented 8 months ago

Part of this is addressed by #737 (ensuring we write 71 fields as intended, rather than 4069!), but that is not that is not the whole story. Tests are ongoing, and I'll write up the answer soon.