Software very slow and stack when load big ascii file from 2m temperature

ecmwf / ecpoint-calibrate

Interactive GUI (developed in Python) for calibration and conditional verification of numerical weather prediction model outputs.

GNU General Public License v3.0

21 stars 8 forks source link

Software very slow and stack when load big ascii file from 2m temperature #112

Closed EstiGascon closed 3 years ago

EstiGascon commented 3 years ago

The main issue is that, with the precipitation we removed all the values < 1 mm, however with the temperature, we consider all the observation points. Example: 8 variables for precipitation contains 2 million of samples in the ascii file, using 00 and 12 UTC runs. For 2m temperature, only 4 variables for the 00 UTC run contains 50 millions of lines in the ascii file. It would be necessary to speed up the computations of this last file, but ideally with the double of size (100 millions of lines), to include, at least, 8 variables inside.

FatimaPillosu commented 3 years ago

Email received from @EstiGascon

Augustin and I have been doing many tests with the temperature data, especially for the extended range experiment and we have found an important issue with the amount of data, compared to what we previously had in the rainfall database.

If we want to use the 00 and 12 runs for all the testing variables (let's say 8 variables), the file is so huge that it is impossible to work with it. We tested to do just with the 00 UTC and only 4 variables and it is still huge (50 million points, compared with 2 million in the rainfall calibration). It means that even with it, we still have a file of 4.5 GB when with the rainfall we had for the whole database, 370 MB. This is because, in the precipitation, we only considered points with precipitation > 1 mm, but with the temperature, we are considering all the observations around the globe.

Anirudha is going to work to speed it up (it is as an issue already in the github repo), however, I think that the difference with the rainfall is massive and I am not sure that at the end we will manage to process it much faster the possible 7-8 GB of file size that we need.

We can wait until Anirudha works with it (I added here in cc so he can be aware of the priority and maybe add some input about how feasible is to fix this issue), but for now it has been impossible to work with the Calibration software with the temperature.

Hi @onyb, I think this needs to get priority as @EstiGascon and @AugustinVintzileos are currently working with temperature. Can you advice them whether there might be a possible solution to this problem. I know that given the numbers, maybe this is pushing a bit the software. They are currently working to find other possible solutions to diminish the size of the ascii point data table, but, if possible, it would be nice to solve this problem in the software.

Cheers,

Fatima

ATimHewson commented 3 years ago

Stopping development of DT for 2m temp, thought rainfall still ok, so dropped from level 1 to level 2. Still critical though as @EstiGascon @AugustinVintzileos now working on 2m temp!

onyb commented 3 years ago

Am I right in assuming that you're only experiencing this slowness while loading an existing ASCII table, and not while generating it?

Can you also confirm that the issue is with the software's apparent inability to load very large files and that the software is not erroneously writing more lines to the ASCII table than it should?

I'm actually not surprised that it's slow. The evaluation of a weather type from the decision tree requires loading the entire ASCII table in memory as a Pandas dataframe. For a file that's 7-8 GB big, you'll definitely run out of memory. There are two solutions I can think of:

offload the postprocessing to a hosted cluster of machines, and perform the computations in a distributed fashion. There is something called Dask with an API compatible with Pandas.
Evaluate the decision tree in a lazy manner. The idea is to partition the point-data table in smaller chunks, evaluate the weather type for each chunk, and somehow combine them together. This will take slightly longer to compute the weather type but should work with large files since we'll only load one chunk in the memory at a time. It's not yet clear to me if this strategy will produce the same results as before. @FatimaPillosu Could this work?

The obvious choice is the second one since the first idea is a lot of work. It'd be nice if you can send me this huge ASCII table, so I can test the correctness of my algorithm.

EstiGascon commented 3 years ago

Hi Anirudha,

Many thanks for your fast answer. Yes, the problem of slowness is only loading an existing ASCII table and for now, we did not have problems during the creation of the file.

I am almost sure that the very large file is because, with the precipitation, we removed a lot of points due to the threshold that we apply of only selecting points with precipitation > 1mm. It means that most of the points observation points are removed. Now, with the 2m temperature, we do not remove any observation, so we use every single point from the observation database. It creates a huge file.

I understand that with Pandas won't be easy to read that, because I tried to read the file this morning with Pandas and csv.reader as well, but my computer did not allow to do it ("MemoryError").

Regarding your solutions: yesterday Tim proposed your second solution of splitting the file in a "coherent" way into different ascii files. We thought about splitting the ascii file depending on the "Local solar time" variable (so one file per each range of local solar times: from 0 to 1, from 1 to 2 and so on....) and in total, we will have 24 ascii files. I am currently finishing a bash script which does it outside the software, so I will send the code to you once I finish it tomorrow and you can adapt the methodology to the python language.

I think that the best solution (at least for now) is this, as the other would take you more time and it is a lot of work, so we can study that solution in the future. But please, @FatimaPillosu , @ATimHewson add any comment if you consider other options. Augustin also agrees with me about the splitting files solution.

ATimHewson commented 3 years ago

Hi All,

Many thanks Anirudha and Esti for your input! Fatima and I have discussed this issue this evening too. We also agree that Anirudha's option 2 is essentially the way to go. But if we analyse the problem a bit more closely, and try to think of a universal way to rectify this issue, that will work in other scenarios, we will I think end up with a better solution for all. Though of course we understand there is some urgency on Esti's side to get things fixed!

At the top level of any decision tree if we have say millions of cases then there really is no point in analysing all of them for the first governing variable. All we need, to decide upon reliable breakpoints there, is a smaller subset, e.g. 1-10% of all cases, although they must be randomly selected in case the point data table is ordered, by chance, in some meaningful way (which we don't want). This percentage deselected should of course be such that the resulting dataset size is manageable.

Then with the reduced-size-dataset we proceed in the usual way, to create level 1 breakpoints, with K-S tests etc. The local solar time is a bit unusual in that we know a priori what breakpoints we want to use, as Esti says, but they can still be entered in the usual way.

After that we could treat each level 1 branch as a decision tree of its own, loading in and analysing the full data for each such branch (not deselected) sequentially. This is much the same as we do now, except for needing to:

load the new point data for each branch (all of it)
save each branches own decision tree definition, and mapping functions, separately, ensuring level 1 codes are included
unload a branch's point data after that has been dealt with
at the end, merging all the results of step 2 above for each branch, to deliver the final decision tree definition

I would like to think that this is a reasonably watertight strategy for any situation (though maybe not if you have an extraordinary amount of data such that subsetting is needed also below level 1!). However I have no idea how technically challenging it would be.

Comments are welcome (@EstiGascon @onyb @AugustinVintzileos @FatimaPillosu) but if this is not clear let's discuss further.

onyb commented 3 years ago

Thanks a lot @ATimHewson for your insights. To put it in very simple terms, your strategy as tweaking the algorithm to load less data.

@FatimaPillosu / @EstiGascon To start tackling the above, could you please upload to Google Drive the following:

Point data table
Mapping functions as a CSV (you can get it from the GUI directly)
WTs as PNG
Decision tree as PNG

I am looking at this performance issue from other angles too. These things could be supplementary to the idea discussed above, so we can push the limits of software even more. In particular, I have come to the conclusion that ASCII tables are a very inefficient way of storing large datasets. I have two questions in this regard.

How important is it for you to store the ASCII table in a human-readable CSV-like format?

If you ask me, it's already impossible to open such files in a text-editor, given their size (several GBs).
Today we have columns such as BaseDate (ex, 2015-06-01) and DateOBS (ex, 20150601) in the point data table. Is it going to be an issue if we use the same format for both these columns?

The reason behind these questions is that we could apply some clever compression techniques to store each unique date only once, and use space-efficient integers to know which specific value is used in each row. I already did some tests with a dummy ASCII table of size 300 MB, that I was able to bring down to 150 MB. I can't guarantee that the outcome will be as dramatic when tested with real data, but it's definitely a nice-to-have in addition to the larger optimizations we've been discussing.

Additionally, changing the file format to something like Parquet (maybe you know about it already) can make read operations much faster, and allow us more flexibility to organize the data in an optimized way, which is not possible today with CSVs. The above compression idea is one such example.

FatimaPillosu commented 3 years ago

Hi Anirudha,

sorry I failed yesterday to send you the information about rainfall. Promise, I will send you this tomorrow morning. I-m not with my work computer at the moment. It seems that you have goo ideas there. I would just comment on some points.

From the point of view of analysing the data, it is important to somehow be able to analyse it. We do look at the point data tables. Probably we don-t need to see every single point, but it is useful to be able to open it and check it if needed. Parquet would allow us to do so? If not, maybe we can add a tool in the software which would allow us to visualize the metadata printed at the top of the point data table and a subset of the actual values.

BaseDate and DateOBS can be in the same format (e.g. yyyy-mm-dd for both), but they represent different things, so we do need to keep them both.

FatimaPillosu commented 3 years ago

Hi @onyb , I have sent you already the data that you requested. It is in Google Drive. Please, let me know if you don't have access to it. Cheers, Fatima

onyb commented 3 years ago

Hi @FatimaPillosu. Thanks for uploading the data.

From the point of view of analysing the data, it is important to somehow be able to analyse it. We do look at the point data tables. Probably we don-t need to see every single point, but it is useful to be able to open it and check it if needed. Parquet would allow us to do so? If not, maybe we can add a tool in the software which would allow us to visualize the metadata printed at the top of the point data table and a subset of the actual values.

Parquet is a binary format, similar to GRIB, so you cannot open it with a text editor and look at the contents. However, on the page where we load the point data table, we can display the metadata, and also the first few rows as a preview.

Note that the old method of using ASCII tables will still be available to the user. The user will be able to select the format he/she wishes to use.

BaseDate and DateOBS can be in the same format (e.g. yyyy-mm-dd for both), but they represent different things, so we do need to keep them both.

Yes, we'll keep both of them. It's just to prevent our point data table loader to interpret the dates as integers.

UPDATE: I just implemented this new loader based on Parquet (see fcc7ee8bd097c4e6b7c0e3db8a2e7c358753c614), and the results are quite promising. The ASCII table sent by Fatima to me previously had a memory requirement of 421 MB. With the new loader, it is down to just 157 MB, so an improvement of 2.7x. I think the difference will be much more for the point data tables generated by Esti.

Note that this feature is not released yet. I'll be adding Parquet as an optional format to the GUI, in the next release.

In parallel, I am investigating Tim's approach too. I'll keep everyone informed if there's any new development.