Deal with big amount of data in the calibration software

ATimHewson commented 3 years ago

Hi Esti, Fatima suggested (as she proposed in a previous email) that we should actually deal with scientific comments like this in this section of github, under the milestone class of "Science Discussions", using the label "Science", so that Anirudha can filter them out, even though we can all still read the contents if we wish. Evidently this relates directly to a technical issue, so we must link to that too. So I inserted such a link below your email copied below for clarity. I am on a big learning curve here too! Tim

Esti wrote:

Hi Tim, all,

I am answering to your comment in github in this mail because Fatima insisted last week not to have scientific discussions inside of an issue in git, so Anirudha won't be confused. So better if we have our discussions here and once that we decide, we write our final proposal to Anirudha.

I agree that your approach of random selected database could be a possible solution. Probably one option could be to be able to define the % of data that you want to keep from the final database (let's say for example, 5%) and then, another option can be to be able to split ascii files depending on ranges of some specific variable. I would suggest having both methodologies separated, so the user can do one or both. This is because probably they won't do it automatically one after the other, since first they must analyse the reduced database to select breakpoints from the first variable in the tree, and then divide them with the "splitting option".

I also think that in many cases, reducing the database in 1-10%, will be enough for the user, so not farther splitting is needed because the software it is fast enough. In other cases, maybe the user wants just to split the ascii file based on some variable ranges that they are interested (like our local solar time). So in this sense, I think that it would be really useful to have both tools but separated one to the other (with "separated" I mean that you do not need to go first to one and then the software asks you if you want to apply also the other, if not that you can apply one methodology or the other independently with no dependencies between them).

Comments are welcome!

Esti

Technical issue: #112

FatimaPillosu commented 3 years ago

Hi all,

I think this could be a way to combine what Esti was saying.

As we don't know in future how big ascii point data tables can be, or how many predictors we might want to analyze, I would suggest that we have a selection of a subset (a certain %), but internally we also verify the dimension of the file, so we are certain that the subset generates a file that is small enough to not create problems. I think this would give us the possibility to analyze the whole decision tree to define a first set of breakpoints for all variables. If is not small enough than the user is asked to insert a smaller percentage.

I think that in this way, we can analyze all the predictors and the whole decision tree in the first place. Let's call it a preliminary analysis. In this way, we could get a better picture of the whole decision tree. I mention this because here we are assuming that we know which variable we want at the top of the tree. in the case of the local solar time, it might be clear and intuitive to do so, but in other cases it might not be so clear.

On the basis of that preliminary analysis, later the user can decide to select only a branch of the decision tree (i.e. two breakpoints for a specific governing variable, or even lower down in the tree if the data is too much) and work with all the points, but only for this subset of governing variables. In this way it might be possible to see whether the preliminary analysis is confirmed or not.

One more thing needs to be considered if we go with this approach. The computational files for each branch, that are produced by the software and are then used to produce the forecasts, somehow will need to be all merged automatically to not introduce possible human errors. This probably can be done by saving internally the computational files internally and then merging them a t the end of the analysis.

What do you think?

Cheers,

Fatima

onyb commented 3 years ago

The combination of Parquet format and Cheaper mode should resolve this. Can we close the issue?

FatimaPillosu commented 3 years ago

yes

ecmwf / ecpoint-calibrate

Deal with big amount of data in the calibration software #116