cf-convention / cf-convention.github.io

sources for website cf-conventions.org
cf-convention.github.io
Creative Commons Zero v1.0 Universal
34 stars 45 forks source link

Visualisations for standard names data (includes POC)? #110

Closed sadielbartholomew closed 1 week ago

sadielbartholomew commented 4 years ago

There are various aspects of the standard names that I find interesting & I suspect would be informative to others in a short summary form, notably:

Such data lends itself well to plots or visualisations, & given the systematic XML encoding of the names in the per-version hierarchical directory structure under Data/cf-standard-names, it is possible to create a script to grab the relevant data & to generate those with, such that can be re-run at any time to pick up on updates, including new versions.

I think a few such visualisations could be useful to have on the site, either directly within the http://cfconventions.org/standard-names.html page, or on a page linked off it. They would mean that anyone can see at a glance how the name table has grown & what has been added, whereas at the moment I think someone would have to trawl through the version directories, or do some analysis via coding or using some tool, to work this out?

I raise this in particular because back in February I wrote a script to plot such aspects for interest, & with the CF Workshop coming up I revisited it. I now have (contained in my personal branch here) that outlined above, a Python script to:

It is designed so it can be re-run at any time to re-generate updated visualisations based on the current state of the repo, without any editing (though some minor re-formatting may be required over time to e.g. tweak the axes bounds on a totals plot to optimise the display).

So I already have a means to build, & re-build as necessary, some visualisations (see the examples below). If it is agreed that it would be good to put up some visualisations on the site, I am happy to adapt the script as you all see fit & put up a PR to incorporate it into this repo so it can be used for the site to generate those, or something similar, to display on a page.

What do you think: would you like to see something like this on the site, & if so what do you think about the visualisations I generated with my script (perhaps as a starting point, I am happy to amend them to fit as you see best with the site)? @japamment I believe you are in charge of the Standard Names, so it would be good to hear from you especially.

Proof of concept visualisations

Generated directly using the script in my branch here (on a state of the repo from a few months back, but I will update it shortly to the current state).

Plot of names per date by version:

From passing the extracted data to matplotlib. (With thanks to @davidhassell who suggested to plot by date rather than by version, & to instead have version shown by indicative markers, after I wrote the initial script).

totals-and-diffs-plot

Word clouds of names present in version A but not version B

A nice way to show all the new names by version. My script has some utility functions to determine the new names as strings & passes them to the word_cloud library. Here are examples, using the versions giving the spikes in the differences in total relative to the previous version in the plot above (which seemed most interesting).

(I think word_cloud is parsing & grouping them by items of one or two words by default, judging by the outputs, but that can be tweaked I am sure if longer phrases or lone words would be more useful to show.)

New additions for v.12:

wordcloud_diff_12_and_11

And similarly, new additions for v.49:

wordcloud_diff_49_and_48

davidhassell commented 4 years ago

Hi @sadielbartholomew, I think that these visualisations look great and will be informative to those working on CF, and also serve as an good advertisement of CF activities. Thanks! It would be really nice to see these updated for each new release of the standard name table.

cofinoa commented 4 years ago

@sadielbartholomew this is interesting. Do the "word cloud" is from the standard name itself? or also is the description been included?

cofinoa commented 4 years ago

with respect to plot of names, it's a frequency/counting plot, I would generate an histogram, instead a X-Y line, with the total number of standard names per version release. Also the stem plot, you are using for differences, it's a good option to be used for total number.

This other type of plot: https://matplotlib.org/gallery/lines_bars_and_markers/timeline.html#sphx-glr-gallery-lines-bars-and-markers-timeline-py could be interesting to see how looks it.

sadielbartholomew commented 4 years ago

Hi @cofinoa, thanks for your comments & sorry for the delay in replying, I have been doing this as a side (mini-)project & didn't get much time in the busy past few weeks to come back to it, but didn't intend to leave it this long before responding. I can dedicate some time to think about this around the CF Workshop.


Do the "word cloud" is from the standard name itself? or also is the description been included?

Yes, the word cloud is formed solely from a list of the standard names themselves. Descriptions are not involved at all, though they could be if you think that could form an interesting or useful further visualisation?

In terms of the processing, I input the set of raw standard names new to a given version, with underscores replaced by spaces, in a new-line delimited list to the word_cloud library. E.g. for version 12 it the head of the input is:

mass concentration of hydroperoxyl radical in air
mass fraction of ethanol in air
mole fraction of hcfc141b in air
atmosphere mass content of cfc113a
atmosphere mass content of cfc11
tendency of mass fraction of stratiform cloud condensed water in air due to icefall
...

The library uses some algorithm (that is sadly not well-documented, other than that it applies scikit-learn's CountVectorizer, so I'd have to look over the code itself for further detail) to weight non-stop tokens (phrases) occurring in the set of text.

It seems like there is overlap on phrases it picks out, which may not be desirable, e.g. in the v.12 cloud I can pick out both 'mass concentration' & 'concentration of' & from looking at the corresponding names list it seems those often stem from the phrase 'mass concentration of'. I can see if that can be changed through the API call if that is preferable to you &/or anyone else who comments.

Overall, perhaps it would be clearer & more useful to have the word cloud instead show just the full unbroken names, rather than token (sub-strings) taken from them, as depicted in the examples I included in my opening comment? In that case, there would be equal weighting & therefore equal size for all names, so the visualisation might not look as striking, but it would be more obvious what is represented.


with respect to plot of names, it's a frequency/counting plot, I would generate an histogram, instead a X-Y line, with the total number of standard names per version release.

Good point, the format of my initial example from my opening comment probably isn't appropriate. It is hard to see, but underneath the dashed X-Y line I do have the same data plotted as a 'step' plot to show the true, namely the discretely & irregularly updated, nature of this frequency data, which in my opinion is most appropriate, even more so than a histogram (though they are not that different fundamentally). What do you think? This is what it looks like when I remove the X-Y line (which I added to try to show the trend but it is probably misleading & improper to have):

totals-and-diffs-plot

(I would need to do some further tweaking e.g. to re-colour the left y-axis red). It would not be difficult to convert what I have to a histogram, so if you think that is a better format than the step plot above, I can happily adapt it.

Also the stem plot, you are using for differences, it's a good option to be used for total number.

True, it would be a appropriate type of plot for the totals, though I am wary that the totals & the difference-per-version figures are really the same data (just read off the jump in the total from one version to the next to get the difference, of course), so showing them both in the same type of plot would depict the same pattern & hence, I think, be excessive. So in my view, ideally we either do one totals plot or we add a difference plot of another plot type...

This other type of plot could be interesting to see how looks it.

Yes indeed, good find. Saying that, whilst I think something of that format could look quite nice, in the case of standard names:

both of which make me think this plot type would end up being a bit too cluttered. But I could always adapt my code to generate one and see; do you think it would be worthwhile to do prototype this?

JonathanGregory commented 4 years ago

Dear @sadielbartholomew Thanks for these interesting plots. I agree with @davidhassell that they would be informative and useful to include on the website somewhere. The timeseries plot demonstrates that the standard name table has been for a long time and still is under active development. Anyone looking at the plot would probably ask what happened to cause the two big increments in number of names in 2009 and 2018. Maybe you could hard-code labels for those steps, with information that is probably stored in Alison @japamment's memory? Perhaps the first one was due to the addition of lots of chemical names. Also, it could be useful to include a second line to indicate the number of aliases, because sometimes groups of new aliases have been introduced in order to change some decision about nomenclature consistently, leading to simultaneous and equal increments in the number of non-alias and alias standard names. Best wishes Jonathan

japamment commented 4 years ago

Dear @sadielbartholomew

I agree these plots are interesting and the word clouds are fun! I'd certainly support publishing them on the website.

Responding to @JonathanGregory:

Anyone looking at the plot would probably ask what happened to cause the two big increments in number of names in 2009 and 2018. Maybe you could hard-code labels for those steps, with information that is probably stored in Alison @japamment's memory?

The big jump in 2009 / Version 12 was due to the addition of standard names for CMIP5, which at that time almost doubled the size of the standard name table! (This was back in the bad old days before we had the CEDA vocabulary editor and generating updates was a largely manual and incredibly time consuming process, so they were done a lot less frequently than now). In 2018 it was CMIP6 names - those were added through a series of monthly updates, thanks to the vocabulary editor.

Responding to an earlier comment by @sadielbartholomew

It seems like there is overlap on phrases it picks out, which may not be desirable, e.g. in the v.12 cloud I can pick out both 'mass concentration' & 'concentration of' & from looking at the corresponding names list it seems those often stem from the phrase 'mass concentration of'. I can see if that can be changed through the API call if that is preferable to you &/or anyone else who comments.

I think it would be useful if you could tune the word clouds to pick out certain whole phrases, such as "mass_concentration", "mole_concentration", "due_to_convection", etc., as this is how the names are put together, rather than only taking single words in isolation. I think doing the plots just for the names, rather than the description text, is the right way to go. It's clearer.

Also, I agree with Jonathan's comment that lots of other changes do take place in the standard name table between versions that are not new names, but would come under the heading of "maintenance", e.g. aliases. We do also update the description text from time to time to make it clearer - this can sometimes affect a lot of names but I don't know if it's something we could keep track of using these diagrams, or indeed whether people would be all that interested in seeing plots of that.

sadielbartholomew commented 4 years ago

Thanks for the comments @JonathanGregory & @japamment & please accept my apologies for the delay in replying.

I think it is easiest & will be clearer to resopnd to comments from you both simultaneously as they are mostly interrelated on certain topics I have tried to separate with line dividers (& I'll try to mention a name so you can skip specific comments if you wish)...


Overall it's great to hear you both find the plots interesting too & would support them going onto the website. I will look at the setup of the website in code terms and prepare a Pull Request to populate those in a sensible place. I'll provide written steps on how to make use of some utility functions I have on that branch so anyone can trivially rebuild the plot & word clouds to update them when there is a new version out.

If either of you (or indeed anyone else) has any thoughts on the best way to incorporate & present the plot or word clouds within the site, please let me know. Otherwise I'll go with what seems most sensible and people can always provide feedback when the PR is up. In fact that option might be easier as I will have something to show as a starting point to reference.

I should mention I have since noticed a few hiccups in the output visualisations I have shown here, notably:

I will check those issues & correct them (if necessary). Hopefully on the PR I put up eventually someone can kindly do a check that the methodology for generating the plots and the outputs are accurate (I do think they are otherwise correct though as my code shows me the overall figures when they are extracted & there is nothing amiss). I'll also update my branch with latest master so I can pick up on the latest state of the name tables (the current state shown is that from around March this year).


The timeseries plot demonstrates that the standard name table has been for a long time and still is under active development.

Yes I agree @JonathanGregory & I think it is very impressive that it has been going strong for ~15 years!


Anyone looking at the plot would probably ask what happened to cause the two big increments in number of names in 2009 and 2018. Maybe you could hard-code labels for those steps, with information that is probably stored in Alison @japamment's memory? Perhaps the first one was due to the addition of lots of chemical names.

The big jump in 2009 / Version 12 was due to the addition of standard names for CMIP5 ... In 2018 it was CMIP6 names - those were added through a series of monthly updates, thanks to the vocabulary editor.

It is interesting you mention that @JonathanGregory as my curiosity as to what was causing the spikes was the motivation to set up the word clouds. The new names to emerge for the versions causing the spikes (should, assuming I haven't got the versioning one out as pondered above) should be those depicted in the two word clouds I provided in my initial comment. The earlier one (by version/time) does seem largely driven by chemistry-related phrases, & the later one almost fully comprising radioactivity-related including many isotopes.

Aha! Thanks for providing the background about those spikes @japamment. It sounds almost obvious now you explain it :smile:! I definitely think it is worth hard-coding as Jonathan suggests some CMIP{5, 6} labels against the spikes to convey that simple explanation.

Also, it could be useful to include a second line to indicate the number of aliases, because sometimes groups of new aliases have been introduced in order to change some decision about nomenclature consistently, leading to simultaneous and equal increments in the number of non-alias and alias standard names.

Also, I agree with Jonathan's comment that lots of other changes do take place in the standard name table between versions that are not new names, but would come under the heading of "maintenance", e.g. aliases.

Those are very good points regarding aliases & other "maintenance" changes, I hadn't thought to consider aspects like that.

I agree with you both that it would be instructive to depict the number of aliases also, & any other maintenance changes that might be interesting.

I'll investigate & include the alias data at the least in the plot(s) that I put in as a PR. It'll be interesting to see what influence if it has on the totals when aliases are not double-counted (as I understand it, if I have interpreted your comment correctly).


We do also update the description text from time to time to make it clearer - this can sometimes affect a lot of names but I don't know if it's something we could keep track of using these diagrams, or indeed whether people would be all that interested in seeing plots of that.

Thanks @japamment, I'll look into it. It certainly sounds useful to include on the plot(s) to depict changes in such aspects if it is possible without too much trouble.


Finally, all in response to @japamment:

I think it would be useful if you could tune the word clouds to pick out certain whole phrases, such as "mass_concentration", "mole_concentration", "due_to_convection", etc., as this is how the names are put together, rather than only taking single words in isolation.

Indeed that does sound like it could be useful. I'll start to look into it and report back either on this thread or in the opening comment of the PR when I have it ready to put up.

I think doing the plots just for the names, rather than the description text, is the right way to go. It's clearer.

Good to hear, as I would be inclined to agree & that is much simpler to do (especially as I have the code for that already)!

sadielbartholomew commented 1 week ago

After thinking about this again around the CF Workshop 2024, now the question is more about where to add such visualisations, so I have created a new Issue. Superseded by #547, so closing.