Suggested augmentation of CMIP6 Summary Table (of archive holdings)

taylor13 commented 5 years ago

Here are some suggestions for providing additional information in summary form to the table at: https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ . @durack1 and other scientists should be asked to help prioritize the list and/or suggest additional information of high interest.

It would be helpful if we used 3 shades of green (from light to dark) in the table to indicate which data was added relatively recently:
- light green: data has been available for more than a month
- medium green: data has been available for more than a week, but less than a month
- dark green: data has been available for less than a week.
Just under the column labels (in a separate row?), give the total the number of models with green boxes in that column. [Similar information could be obtained with 5 clicks in the CoG search page.]
Just to the right of the source_id's (in a separate column?), give the total the number of activities with green boxes in that row. [Similar information could be obtained with 5 clicks in the CoG search page.]
Construct a similar table but rather than indicating the number of datasets, indicate the number of experiments performed and the number of simulations (e.g., "5 / 8" could indicate that 5 experiments were performed and output is available from 8 simulations)
Construct a similar table but rather than indicating the number of datasets, indicate the number of variables available for download.
For each "activity" listed in the column headings of the upper table, link to a table with models defining rows and experiments (performed as part of the activity) defining columns. For example, clicking on the column heading "CMIP" would get you to the table appearing at the bottom of the current page (https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/).
Construct a table with Columns= Activity and Rows=sampling frequency; each cell would indicate how many models have contributed output at each frequency. Then provide links to tables focusing on individual activities (by clicking on this new table's column headings) where one would find: Columns=experiments (for a single activity) and rows=sampling frequency; each cell would indicate the number of models that have provided output at each frequency for each experiment.
Once a day (or so), save the statistics from the tables into a "data base" that could be used subsequently to track growth of the archive. [This information could also be harvested directly from the ESGF catalogs.]

durack1 commented 5 years ago

@taylor13 thanks for opening this, and the detailed suggestions above.

I still think that having a table that includes the entire "wish list" of contributions, extracted from the current CMIP6_CVs/CMIP6_source_id.json file is a good idea. The reason I think this is useful, if an analyst plans to generate a CMIP6 multi-model analysis, it would be useful to know what the complete archive will be comprised of (expectations anyway). This will be particularly helpful for AR6 contributors, as it will allow an analyst to make an informed decision about what proportion of the expected historical multi-model archive (for e.g.) that are currently generating results from.

In my CMIP5 experience, any insights from early CMIP5 analysis was considerably comprised, and results did not settle until around ~20 model (just under half the archive) became available. I queried this with @pochedls yesterday and he seemed in agreement.

@mauzey1 @sashakames happy to hear your views too!!

taylor13 commented 5 years ago

I can see value in the table you suggest, but I think that table should be reached via a second click from text just above or below the current table: "Expanded Table View Indicating Output Expected in the Future" It could look like the first table but with perhaps light yellow boxes added, indicating models who intend to provide output.

pochedls commented 5 years ago

For my purposes, the most useful table would look something like this:

Experiment	Models with data in ESGF / Expected Number of models
amip	3 / 107
piControl	2 / 56
historical	5 / 94
abrupt4xCO2	7 / 107
...	...

Most people want a quorum of models and fairly standard output variables; the table above tells you if you should start your analysis. Something similar with number of simulations could be useful, too, but I think the table above is probably a pre-requisite to thinking about how many ensemble members models have for a given experiment.

taylor13 commented 5 years ago

I had suggested that under the experiment_id at the top of the two tables we would include the total number of models with output currently available (i.e., the sum of the green boxes in the column). We could put @pochedls numbers in the same place in the format he suggests. Then we wouldn't need any new tables. In any case, I agree that the total numbers are more important than indicating which models are expected to contribute, but showing the full matrix would obviously provide additional information.

mauzey1 commented 5 years ago

I have started making changes to the tables starting with the shading based on dataset submission time.

screenshot_2019-01-18 esgf cmip6 data holdings

durack1 commented 5 years ago

@mauzey1 nice! This looks great..

It might be a good idea that we also include a link to the experiment_id entries along with the source_id entries at the bottom of the page - providing more information for folks browsing these pages and interested to know more..

A tweak I'd recommend "Within 7 days" -> "Less than 7 days"

Happy to provide commentary on continuing tweaks!!

taylor13 commented 5 years ago

Cool! Exactly what is needed. Thanks!

Or instead of "Within 7 days", "Younger than 7 days"? Also, is should it be "...how recently..." rather than "...how recent..."?

sashakames commented 5 years ago

To comment on (8) from Karl's list. While we aren't currently collecting the history, another way to go about it is once the "Wayback Machine" posts daily static pages from 01/2019, we can go back and capture history of the page: https://web.archive.org/web/20190815000000*/https://pcmdi.llnl.gov empty now, but perhaps sometime next month.

durack1 commented 5 years ago

@sashakames I think actually capturing the numbers is far better than depending on the wayback machine. If we have the numbers in a text file (or similar) it will be trivial to load and query this rather than have to write a clunky script that polls an external web archive.

@pochedls has a mysql? database configured for the xmls, maybe we could create a new database with an appropriate table for this info

pochedls commented 5 years ago

@durack1 - I am capturing some of these stats when I do scans (see screenshot of some of the db stats). It was part of this issue.

Perhaps a conversation for a different place, but it might be helpful to have read access to the ESGF database (or read access to a daily database dump).

durack1 commented 5 years ago

@pochedls this is excellent, as we have a database of the local replicated data, to augment this, having a database of the ESGF global data is what I am proposing here..

sashakames commented 5 years ago

The point of using the internet archive would be to backfill missing days from our record. Also before creating an additional sql database, we should see if we can pose the desired queries to The Solr index. If that is too limited, then ok consider additional tables in MySQL

mauzey1 commented 5 years ago

I have added total values for both the rows and columns of the tables. screenshot-2019-1-24 esgf cmip6 data holdings

@taylor13 @durack1 How would I find numbers for the amount of experiments performed and simulations available?

durack1 commented 5 years ago

@mauzey1 this information is contained in the file https://github.com/WCRP-CMIP/CMIP6_CVs/blob/master/CMIP6_source_id.json - this will provide each activity_id intended to be contributed to for each model, it's contained in the activity_participation field.

I have some code that reads this file off the github repo directly - see here for usage, and the function is contained here

I note that we have only registered information about which activity_id a model intends to contribute, there is no registration of the experiment_id within the activity_id at this time

taylor13 commented 5 years ago

In the table above, I think it would be more useful to simply add up the number of green cells in each column and in each row. So, for example, 7 is the number of models that have performed the "historical" experiment and BCC-CSM2-MR has participated in 1 activity (CMIP). The row heading would read "# of models" and the column heading would read "# of activities" in the upper figure and "# of expts." in the lower figure.

taylor13 commented 5 years ago

I think I would leave the first row and column uncolored. this would help users realize that these cells provide different information from the others.

mauzey1 commented 5 years ago

@taylor13 For (4) on your list, would the number of simulations be the number of variant labels for a source and activity?

taylor13 commented 5 years ago

No ... it's a little more complicated. For each experiment in an activity you would count the number of variant labels (that gives you the number of simulations on a per model per experiment basis), and then you would sum over all of that activity's experiments to get the total number of simulations performed under that activity. If that's hard to do, maybe we should just do (4) on the 2nd table.

durack1 commented 5 years ago

@mauzey1 the reason I suggested you read the registered information direct from the repo above https://github.com/PCMDI/pcmdi.github.io/issues/246#issuecomment-457329745, is because we’re constantly receiving changes requests and updates, so if it’s not automated it’ll become a full time job keeping it up-to-date

taylor13 commented 5 years ago

I think Paul is asking for something different from what I requested in (4): he wants to know how many models aspire to participate in a certain activity. That info. is not available in the ESGF catalog, but it is available in the CV. I think that's a lower priority "to-do" than some of the others, but maybe Paul has a different view.

durack1 commented 5 years ago

@taylor13 you're right.

I believe that having an idea of what we're expecting will be a very useful metric to capture, which will then allow an analysis (and particularly AR6 authors) to know what is the likely status of the CMIP6 multi-model mean at any point in time.. If there are less than say 30% of models expected, then any analysis would not likely warrant much quantitative consideration

mauzey1 commented 5 years ago

Here are the revised tables with another table for the # of experiments/# of simulations. The tables have gotten wider due to more activities showing up. A scroll bar appears at the bottom for seeing the rest.

screenshot-2019-1-29 esgf cmip6 data holdings

taylor13 commented 5 years ago

Outstanding! Can we install this for live updates for everyone to see? Does the scroll bar control the upper figure too?

Also, if item 6 is implemented (as described in the first posting of this issue), I think we should remove the middle graph on the current page; it would be seen by clicking on the "CMIP" column heading of the first table. It would be good to link the recent table (bottom one) to each activity in the same way and generate activity-specific versions of the bottom table.

mauzey1 commented 5 years ago

@taylor13 The scroll bar controls all of the tables and text. I can set it so that only the wide tables get a scroll bar.

mauzey1 commented 5 years ago

I have included tables for the number of variables for each activity, and the number of models for each sampling frequency and activity. I will start working on making tables for each activity.

screenshot-2019-3-6 esgf cmip6 data holdings

taylor13 commented 5 years ago

Very nice!
I might tweak the description just above the last table above to read: "Number of models providing output at each sampling frequency in support of each CMIP6 activity."

durack1 commented 5 years ago

@mauzey1 nice from me too!

taylor13 commented 5 years ago

@mauzey1 I received lots of positive comments in Barcelona concerning the usefulness of these tables, so thanks again. There was also a request to create tables for each activity (like the CMIP table listing all the CMIP activity experiments).

I think it would be good to make this a priority. The original idea was:

6.  For each "activity" listed in the column headings of the upper table, link to a table 
with models defining rows and experiments (performed as part of the activity) defining 
columns. For example, clicking on the column heading "CMIP" would get you to the table 
appearing at the bottom of the current page 
(https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/).

but perhaps there is an easier approach.

mauzey1 commented 5 years ago

@taylor13 @durack1 @sashakames

I have updated the data holdings to include pages for each activity.

The main CMIP6 page with links for each activity. Screenshot-2019-4-16 ESGF CMIP6 Data Holdings full

Data holdings page for CMIP. Screenshot-2019-4-16 ESGF CMIP6 CMIP Data Holdings full

Print-friendly version of CMIP data holdings. Screenshot-2019-4-16 ESGF CMIP6 CMIP Data Holdings Print View full

mauzey1 commented 5 years ago

@taylor13 @durack1 Were there any other metrics or features that we want on the holdings page? Should we close this issue?

PCMDI / pcmdi.github.io

Suggested augmentation of CMIP6 Summary Table (of archive holdings) #246