Closed DamienIrving closed 9 years ago
I've updated my script which creates a list of all files available in unofficial and authoritative /g/data1/ua6/unofficial-ESG-replica/tmp/tree/esg-tree-LATEST-paths.txt so now I'm also creating at the same time a list of model names /g/data1/ua6/unofficial-ESG-replica/tmp/tree/model-list-LATEST.txt these two are actually links to the latest available list, so name remain the same
Hello @paolap - thanks, those files look really good.
I should speak to @DamienIrving about it, but do you have a vision of how this information could be integrated into the workflow tool? The interface limitations (and my lack of Qt skills!) mean that doing data discovery through the workflow tool is quite hard - it is difficult to get information back to the user. I can see that having access to which files are present could be very useful though.
I will try adding a module that summarises an input DataSet
and prints the Constraint
values - I will let you know when this is done.
Also, NCI are in the process of building a database of downloaded data that they hold, and I think the plan is to put a searchable web front-end on it. When that gets going it would complement the system well.
Hi Tim, not sure how to, I don't know CWSlab much, I'll have a look. I'm pleased to hear that NCI is setting up a database, one thing I'm trying right now is to make the same "list path" available as a sqlite python database, this should be relatively straightforward, with a smal modification of my script. The obstacle I can foresee is how sqlite would handle such a big table, so probably I'll have to split it into experiments and so on. Probably in that format is easier to integrate? One question I have is also how often is drstree updated? So when I'm adding data to tmp/tree what's the time lag for it to appear in drstree?
Hi Paola - I've just added a workflow called dataset_summary to the workflows repository that should meet this need. (It requires the devel
version of the plugin to run. If you get a chance, check it out.
If you enter the required parameters into the CMIP5 Constraints
field (such as variable
, experiment
, institute
etc) it will print a summary to the screen of what values these parameters take (what tos
files are there for rcp26
etc.), as well as a list of all the files in the drstree
file system match these parameters.
There are a few limitations to this though - it can't search on any metadata, it only works by pattern matching on the file system. The new NCI database will be fully searchable on metadata. It also only reads the drstree
path.
As for how often the drstree gets updated, I am a bit out of the loop with that. I think that this is being administered by @taerwin, so it could be worth talking to him about it. I seem to remember that the updates were scheduled through a crontab, but I am not certain.
Hi Tim,
I had a go at it and it works. If you're interested I've modified one of my scripts so it gives a sqlite database instead of a csv file. It still uses as input the file list in /tmp/tree and you can choose various constraints, it will store the result in a database with one table called "cmip5" and the following fields: id (which is the "ensemble" path on raijin it's unique and works as index), variable, model, experiment, mip-table, ensemble, version. One row is one ensemble rather than a file to keep it more manageable. An example which I run without constraints, so it lists all the authoritative+replica is in .../tmp/tree/cmip5_raijin_latest.db (this is actually a link to cmip5_2015-06-09.db in same dir, just to make it easier to update).
Hi Paola - I'm glad that's working. I would be interested in having a look at your script with sqlite database. Is it publicly visible anywhere? Feel free to email me at the Bureau, or perhaps a private message on the gitter board.
@paolap has a Python script for listing all the models/data that are available on NCI. It would nice to implement that so people can know what to enter into the
Constraint Builder
.