Open JonathanGregory opened 8 months ago
Thanks, Jonathan. I shall investigate ...
The aggregation was fast: ~240 seconds on my laptop, with a local copy of your data.
I've started the CFA write. Let's see how that goes: it's 17 minutes in already, and still going ... The output file is growing at 250 kB/minute, which seems quite slow to me, so I'll dig deeper into this.
However, at only 4 minutes to aggregate on the fly ...
Here are my aggregate/write times:
In [19]: %time f ='*.pp')
CPU times: user 3min 45s, sys: 2.36 s, total: 3min 47s
Wall time: 3min 54s
In [20]: len(f)
In [21]: %time cf.write(f, 'delme.nca', cfa=True)
CPU times: user 1h 37min 15s, sys: 56.8 s, total: 1h 38min 12s
Wall time: 1h 39min 22s
In [22]: !du -sh delme.nca
25M delme.nca
Dear @davidhassell
Thanks for the tests. Four minutes is quick for aggregation. That is an impressive speedup, indeed. However, it's too long to wait for accessing a dataset when doing interactive analysis. If you could speed it up by another factor of 100, it would be fine. :smile:
My test on racc-login-2
is still running. After nearly a day, it's written 1.9 Mbyte. Presuming it's trying to produce the same 25 Mbyte file as your test did, it will take more than three weeks to complete, which is too long to wait even for a batch job. Do you understand how it can take three weeks, or even 100 minutes, to write a netCDF file of 25 Mbyte? I haven't seen it yet, but I guess it probably contains a few hundred fields, doesn't it, metadata only of course.
To make the pph
file for this directory takes about 10 minutes. This is simply a concatenation of the pp
headers produced by reading all the pp
files. du -sh pph
gives 1.5M of actual disk space, du -sh --apparent-size pph
gives 11M, which is what you'd expect for 42,000 headers of 256 bytes each plus block control words. Presumably it gets compressed by the file system owing to zeros and repetition. How can the CFA file be more than twice as big as as the pph
file? The aggregation should have made it much smaller, shouldn't it?
Best wishes
Some more information. I can ncdump
the CFA4 file which is being generated, and I find it has so far produced 2800 fields. A file of 25M would therefore contain 37,000 fields, which is quite similar to 42,000. This seems to suggest that it's not aggregating at all, and producing one output CF field for each input pp
field. There are 210 2D pp
fields in each pp
file, and 68 distinct stashcodes, so I think that after aggregation we should have 68 CF fields.
... We've just discussed this. Your experiment shows that it's not aggregating the specific humidity fields. That would explain why there are so many output fields. It does aggregate all the others, you say, but yet it still takes 30 minutes to write the 67 (I suppose) aggregated fields to the CFA file, without data.
has finished! It was only about 2 days, not 3 weeks, probably because most of the fields were aggregated, as you found yesterday. In the end there are 4072 fields in the file, which can be explained as 4000 for non-aggregated specific humidity, and 72 for the aggregated quantities. The file is 4.7 Mbyte actual disk space, 26.7 Mbyte apparent disk space, probably the same as yours. As we discussed yesterday, it's another question why the file took 1.5 h write on your laptop, but 2 days on RACC, which is not generally slow for writing netCDF. But perhaps we will understand this soon.
Part of this is addressed by #737 (ensuring we write 71 fields as intended, rather than 4069!), but that is not that is not the whole story. Tests are ongoing, and I'll write up the answer soon.
Dear @davidhassell and @sadielbartholomew
A few months ago I recall David reporting much faster time for
files. I've installed the latest version of cf-python and dependencies, I believe:In
on the RACC I am executingcfa -f CFA4 -o nca *.pp
. The directory contains 42,000pp
files, each containing onepp
field. So far, it has been executing for a couple of hours. Should it take this long?Best wishes and thanks