Plausible data sets/analyses for motivating example

rolyp commented 10 months ago

Plausible (ideally real-world) data + analysis for the motivating example (originally #765).

See the ICCS paper on Bayesian Model Averaging, which mentions Dynamic Time Warping in the context of weather (not climate) simulation. Two ideas for possible data sets:

Weather data used in DTW article linked to from above paper
Climate simulation data used in above paper

min-nguyen commented 9 months ago

I can't find what they're referring to as weather data in the linked paper.

I can however see mention of climate data (generated by GCMs), in Section 3 (Data):

Daily mean surface temperature simulations from the historical experiment of five GCMs from the latest phase of the Coupled Model Intercomparison Project (CMIP6) were used to demonstrate the method presented here: GFDL-ESM4, IPSL-CM6A-LR, MPI-ESM1-2-HR, MRI-ESM2-0 and UKESM1-0-LL.

The datasets are quite large (perhaps 50mb), so here are some HTTP download links to specific ones. They come in the form of NetCDF files:

Some datasets don't have links, so here are some instructions for how to get them (you can search for similar datasteps by applying the same steps). For the output data from the model UKESM1-0-L: 1) Go to https://esgf-node.llnl.gov/search/cmip6/ and search for "UKESM1-0-L". 2) Download one of the WGET scripts from an appropriate result (which should specify the number of files contained as being more than 0). 3) Execute the script with the argument "-H", and when prompted, provide the username (open-id) "fluid" and password `Fluid2023!". This should download a NetCDF file e.g. "cropFrac_Lmon_etc.nc".

rolyp commented 9 months ago

Great, thanks. Yes I figured that the original data sets might be quite huge. Here’s the repo associated with that paper. According to the README, the data used should be in a folder called data (but there is no such folder). So I’m wondering if it’s worth email her to ask her to populate that folder for us. (Seems reasonable given the claim in the README.)

Alternatively, there’s a Python notebook in the repo which claims to download the data set. So you could try running that and seeing if that’s a more useful starting point.

rolyp commented 9 months ago

@min-nguyen Then it would make sense to pick up this task. From the comments in the Python repo, it looks like they download the data set, filter to 2 years of data for Nairobi, and then save it as a local file. So the first task is probably to get this script running locally. Then we can figure out how best to proceed, i.e. it’s worth storing a copy of that local data file in our repo or whether we can just convert directly to Fluid. (Either way we can point to their repo as the source of the Python script we used to get the data, assuming it works!)

min-nguyen commented 9 months ago

Ignore below. Some notes for me:

Running ipynb on anaconda3 environment (Python 3.11.5)

pip3 install esgf-pyclient pip3 install geopy pip3 install xclim pip3 install netcdf4

Download one of the nc dataset files from the reanalysis reference dataset W5E5 Data set. The smallest one is this one.

Move the nc file to the current directory CI_2023/

W5E5 = xr.open_mfdataset(os.path.join(wd, '*.nc'), engine = 'netcdf4').sel(lat=latitude, lon=longitude, method='nearest').convert_calendar("noleap")

min-nguyen commented 9 months ago

I've managed to get a working version of the script data_download.py or data-download.ipynb: https://github.com/explorable-viz/CI_2023/tree/main. I've also had to install some Python dependencies manually, rather than use conda as suggested by Mala -- I think they forgot to include the conda_environment.yml file in their repo.

The extracted .nc files from the first 2 years are actually quite small, so I've also pushed them to the repo, found under CI_2023/data/models/ and CI_2023/data/reference (i'm not sure what the meaningful differences are between the files in each directory). A nice way to visualise .nc files in VSCode is with the HS5Web extension.

I'm not familiar with the formatting of these files, nor what specific data is important to us in the datasets, so perhaps we should talk about how they would be presented in fluid code.

rolyp commented 9 months ago

Awesome. Maybe we should add a step that exports the data into some human-readable format like CSV and then work out how we might want to filter/select within this further (if necessary) before converting to Fluid. Or maybe it makes more sense to look how they use the .nc files in their algorithm – I assume at some point they load them into Pandas dataframes or similar so maybe that would be a more useful intermediate format for us.

JosephBond commented 8 months ago

One useful thing we should do for this would be to collect two sequences that are of equal length. This means that we can take the window parameter down to a smaller one, a slack of 1 in either direction, this will potentially reduce the amount we pick up of either input in backward slicing. Might it be worth me extending our current sequences of real numbers with a few extra so I can get a better idea of what the calculated selections look like?

JosephBond commented 8 months ago

Just going to start summing up my current thoughts on the motivating example. We want to be able to show something with as interesting and non-trivial a linking as we can, but this is of course mediated by the complexity of the example we will actually be able to implement.

Any example we choose will need an additional property to be, in my opinion, compelling. Fix Input views V1, V2 and output views V3, V4, we do not want one linking to subsume the other, ie (fudging notation in a way I hope makes sense here): The selection found by following V1 -> V3 -> (V1 , V2) should not be a proper subset of the selection found by following V1 -> V4 -> (V1 , V2). I think it is fine if this condition does not hold up going the other way, from V2 -> V3/V4 -> (V2,V1), we only need it to be in one direction to make a compelling example. I am not sure we have the space to run both linkings in both directions in depth. In at least one direction, but not necessarily both, we also want some self-linking to occur. A question raised recently is if we want to cover both self-linking, and round-tripping in the example, since these are related. It could be used to segue into the discussion on De Morgan Duals, "one may ask if we can round trip to find the other data that our necessary data is also necessary for,"

For the bubble chart, one possible option is to use the individual cities from the cities dataset as bubbles in the chart. We could link to countries by comparing, say, a cities GDP or estimated water consumption, with that of the country overall. The bubble chart could represent a snapshot of the data per country, if those were arranged by year as well. The downside to using cities in the bubble chart is that it is hard to think of how to aggregate them alongside countries in the line chart part of the example.

Another point of difficulty is how similar or distinct from the previous example we wish to be. Is using an example similar to the previous one a positive, since the reader is already in that mindset, or is it a hindrance in confusing the reader as to which of the examples are being referred to elsewhere in the section.

rolyp commented 7 months ago

Decision for now – plot non-renewables against renewables, with bubble size representing something about GDP.

JosephBond commented 7 months ago

I have made a pretty decent first approximation of the analysis we have described, but we see an interesting artifact of the dependency analysis in my current implementation of the example. Since it relies so much on record project, currently there are no dependencies computed, (non-renewables to renewables). I will experiment and see if the other direction works as is. If not, I will need to rethink the way I unify the two tables in the program.

JosephBond commented 7 months ago

I was able to cobble together some reasonably valid data about energy capacity for China, the USA and Germany, but it's been difficult, and there aren't unified sources of information. I need to go back through and catalogue where I got each of the different numbers from so that we can reference them, or at least point the reader in the correct direction.

JosephBond commented 7 months ago

Okay, I think that this source of data appears to be both reputable, and also contains pretty much all the information we could potentially need. It's a pretty hefty CSV, but has years of global data, including energy capacity.

I'm going to pick a year (I propose 2015, to work with the data we already use in the bar chart), and lift the information from here into our fluid file.

JosephBond commented 7 months ago

Closing as completed this via the previously mentioned dataset. May reopen if/when we revise the linked-outputs example according to explorable-viz/graphical-slicing#315

explorable-viz / fluid

Plausible data sets/analyses for motivating example #770