benmsanderson / METEOR

Multi-timescale pattern scaling
Apache License 2.0
1 stars 0 forks source link

Cmip6MeteorDataGetter might pull training data inconsistently #22

Open normansteinert opened 1 day ago

normansteinert commented 1 day ago

The data getter may pull data CMIP data inconsistently. My suspicion is that this may be related to ESGF server availability.

I pulled data on different days and got a slightly different models working. In this particular case, some piC data that worked inconsistently, but might also be an issue with scenario data. My guess is that this is model-specific.

Ideally, you get the data once, and have all training data stored locally in 'cache'. But this might be sensitive to when this is done, and is going to be a problem with reproducibility, and the plot scripts working universally.

Code: data_getter = Cmip6MeteorDataGetter(exps=["piControl", "abrupt-4xCO2", "historical", "ssp245"], flds = flds, dbe=['CMIP','CMIP','CMIP', 'ScenarioMIP'])

This is an example error message from 'make_meteor_training_data' for models that don't work:

Screenshot 2024-11-22 at 09 31 51

benmsanderson commented 1 day ago

Ignore my prior message - it's not ESGF - we're pulling this data from the Google cloud mirror. But it looks like a specfic issue with the fields or dimensions being inconsistent between two experiments or something? Difficult to be sure without seeing the full output.

maritsandstad commented 1 day ago

@normansteinert: have you tested which models have this issue? This issue could be a reason to bump the inbuilt caching thing up the priority list, possibly including a "check-availibility-for-models-not-in-cache"-option, which would then get you a results which isn't necessarily always the same, but always progressively more or equally comprehensive as what you had before. Also, is it just the datagetter definition line that throws this error. It's not, right? It's the subsequent looping over models? How do you get the models to loop over? The datagetter should in principle know which models have available data, but if you rerun only parts of a notebook that may be a problem...

normansteinert commented 1 day ago

Yes, that's correct. The data getter works fine and just gets whatever it finds. It seems to be just a handful of models that do not work, so you will still end up with a good 35 or so of models. The error then comes from creating the training data (with make_meteor_training_data). I took out the models that didn't work manually so far. To have these models or not does not make a big difference for the multi-model results. But it may require reworking the list of models that work/don't work depending on when you run the script and do the caching. I think a functionality might be nice that can check the data before it is turned into METEOR training data, so you don't have to worry about error messages when running the script, even though the exact list of models might vary slightly. I also think that this is slightly separate from the caching, as this is just a means to run the script with loads of models. AndI have the latter in my script and it is also in one of Ben's example notebooks.

maritsandstad commented 1 day ago

It's been a minute, but at least at some point the datagetter could get you a list of available models. That list was what I got the model list from earlier (rather than making a manual one). You could always cross check that list with some list you had made for yourself to get the models that you want that there is data for. I can double check that that's still available next week though. @benmsanderson made some changes to make it more flexible, but maybe that also got rid of that functionality, though I shouldn't think so...