Pull standard names into Python

lhmarsden commented 1 year ago

Hi,

I would like to pull the latest version of the standard names - including descriptions, units and the grouping - into Python for a template generator I am building. Is there a way to do this?

Of course I could pull them from here: http://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html and configure something manually.

But ideally I would like to build something more future-proof, accounting for any additions to the standard names list and also any possible reformatting of the above page. I assume this page pulls data from somewhere. Is that 'somewhere' publicly accessible? Perhaps an API?

Can you please help me with this?

Thanks!

Luke

MathewBiddle commented 1 year ago

You could use pandas to read the xml endpoint?

https://cfconventions.org/Data/cf-standard-names/79/src/cf-standard-name-table.xml

For example,

import pandas as pd

df = pd.read_xml('https://cfconventions.org/Data/cf-standard-names/79/src/cf-standard-name-table.xml', xpath="entry")

df

lhmarsden commented 1 year ago

Thanks! Just what I need.

Do you know if the information to group them is somewhere too? I couldn't see this in the XML. For example, if I select 'Sea Ice' on this page: http://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html

It would be super if I could group the standard names in the same way you do - accounting also for any new terms or changes in grouping.

zklaus commented 1 year ago

If you take a look at the javascript on that web page, you will see that this grouping is really simply a text search with carefully selected keywords that basically sets the display style of not matching HTML table rows to invisible. You can see the exact filter applied after clicking on the group category manifest in the search options of the form.

sadielbartholomew commented 1 year ago

Hi @lhmarsden, if you are willing to 'pull' them from the canonical source code for the tables, i.e. under various directories organised by table number in https://github.com/cf-convention/cf-convention.github.io/tree/main/Data/cf-standard-names, rather than the rendered site data of those names, which is more robust in line with your desire here:

ideally I would like to build something more future-proof, accounting for any additions to the standard names list and also any possible reformatting of the above page.

then I already have some Python code that gets all of the names from any (or all) version(s) of the table and outputs them as a dictionary (it uses regular expressions to parse the XML which might not be the simplest way but it works a charm and is quick and robust, so good enough). I wrote those functions to allow me to use the outputs to create the plots describing totals and nature of the standard name sets as described in the issue here: cf-convention/cf-convention.github.io#110, but I realise there could be wider use for the code.

I had that code on a personal git branch but have since moved it to tidy it up, so the current working code is not available for me to share yet, but I can put it up somewhere shortly if this is the kind of thing you are looking for?

I should add, my code presently doesn't pull in the further information such as:

including descriptions, units and the grouping

but it can be trivially adapted to include this information too. If you would like, and give me a few days to find time to make the necessary tweaks, I can make the trivial adaptations so that my code that I can share includes those?

lhmarsden commented 1 year ago

Thanks all for your interesting and helpful replies!

I think I will go with @MathewBiddle and pull the data from XML, and then group the terms using a text search. I hope over short to medium time scales, this should be suitable, and this approach is very simply so it will presumably be simple to adapt any code as necessary in the future.

Thanks @sadielbartholomew I see that your solution is indeed more future-proof, but I will stick with the simpler approach in this case. And thanks for your generous offer of help.

DocOtak commented 1 year ago

@lhmarsden In the hope that it might be useful, here is some code I use to load the xml table into an sqlite database linking to just the xml reading part: https://github.com/cchdo/params/blob/ce69f81afdc92e2128494198539362549d4f2880/cchdo/params/__main__.py#L26-L60

It does have a check to make sure I'm loading the standard name table version it is expecting, that could be removed.

MathewBiddle commented 1 year ago

@lhmarsden you can adjust the url in that code to point to the current xml document hosted on GitHub (which I just learned about through this conversation, so thank you for presenting this opportunity to learn something new):

https://github.com/cf-convention/cf-convention.github.io/raw/main/Data/cf-standard-names/current/src/cf-standard-name-table.xml

That way you pull over the most recent table every time you run the code.

lhmarsden commented 1 year ago

Very useful, thanks all

lhmarsden commented 1 year ago

@lhmarsden you can adjust the url in that code to point to the current xml document hosted on GitHub (which I just learned about through this conversation, so thank you for presenting this opportunity to learn something new):

https://github.com/cf-convention/cf-convention.github.io/raw/main/Data/cf-standard-names/current/src/cf-standard-name-table.xml

That way you pull over the most recent table every time you run the code.

I think you can also use this which is a bit of a neater URL

https://cfconventions.org/Data/cf-standard-names/current/src/cf-standard-name-table.xml

JonathanGregory commented 1 year ago

This question has been answered, so I'm closing this issue. I have opened website issue 408 to propose that we provide a link to https://cfconventions.org/Data/cf-standard-names/current/src/cf-standard-name-table.xml, the URL suggested by Luke @lhmarsden. Thanks, Luke.

cf-convention / vocabularies

Pull standard names into Python #144