CWSL / cwsl-mas

VisTrails plugin for Climate Model Analysis
Apache License 2.0
6 stars 32 forks source link

Sanity checker #42

Open DamienIrving opened 9 years ago

DamienIrving commented 9 years ago

We need a sanity checker to check for problems with the data.

In a workflow you would go Constraint Builder -> CMIP5 -> Sanity Check.

DamienIrving commented 9 years ago

@didiermonselesan and @willrhobbs in Tasmania pretty much know all the sanity checking that would need to be done. They are going to list the types of things that need checking below...

captainceramic commented 9 years ago

Also related to this (and a wider issue than just the plugin) is to work out what to do when the sanity checker fails - how do we deal with 'faulty' data in the archive at NCI?

didiermonselesan commented 9 years ago

Tim,

Agreed. The first thing would be to log the failures and the reasons why the workflow fails on particular files. One of the first steps for CMIP or any data files of the same kind would be to check individual files (as produced by the modelling group) to make sure they comply with the a priori agreed ‘conventions’ regarding, for examples

1) naming of the file 2) naming of the file together with meta data (especially inheritance, i.e. parent experiment, time correspondence, etc …) 3) data itself, correct missing/fill values, data range given unit given variable, monotonicity and gaps given time frequency and bounds, missing coordinates and coordinates bounds

Then possibly moving to more ‘subjective' checks

4) Physically relevant checks (e.g. daily and seasonal cycles

Note that some packages do perform conventions checking either for single files or when apply across files.

Cheers, Didier

From: Tim Bedin notifications@github.com<mailto:notifications@github.com> Reply-To: CWSL/cwsl-mas reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, 5 June 2015 4:26 pm To: CWSL/cwsl-mas cwsl-mas@noreply.github.com<mailto:cwsl-mas@noreply.github.com> Cc: Didier Monselesan Didier.Monselesan@csiro.au<mailto:Didier.Monselesan@csiro.au> Subject: Re: [cwsl-mas] Sanity checker (#42)

Also related to this (and a wider issue than just the plugin) is to work out what to do when the sanity checker fails - how do we deal with 'faulty' data in the archive at NCI?

— Reply to this email directly or view it on GitHubhttps://github.com/CWSL/cwsl-mas/issues/42#issuecomment-109175388.

willrhobbs commented 9 years ago

My feeling is that rather than using the 'sanity checker' to handle exceptions it's probably best just to flag issues that can then be investigated. I also agree that as a first run checking the metadata on each file is complete and accurate would be a good start (we would hope that PCMDI would do this, but that's unlikely).

Physical-consistency checks would also be helpful, but we'd need to consider each variable separately. Some variables lend themselves to 'hard limits' e.g. salinity and precip can never be less than zero (although they are in some CMIP5 models!); others are less obvious (what should the maximum allowable be temperature be?)

I would add a test of the time coords on all merged lists of variables across a range of files. Time arrays should be consistent across all files within the same experiment, monotonic, have no gaps, and no 'overlaps'. In my experience these types of errors have been the biggest and most persistent headache.

There are some simple 'experiment' tests based on global energy balance that both Didier and I have found can highlight some sneaky issues; these would not be applied to all files of course, but are a useful model diagnostic.

Will

From: didiermonselesan notifications@github.com<mailto:notifications@github.com> Reply-To: CWSL/cwsl-mas reply@reply.github.com<mailto:reply@reply.github.com> Date: Sunday, 7 June 2015 9:06 AM To: CWSL/cwsl-mas cwsl-mas@noreply.github.com<mailto:cwsl-mas@noreply.github.com> Cc: Will Hobbs will.hobbs@utas.edu.au<mailto:will.hobbs@utas.edu.au> Subject: Re: [cwsl-mas] Sanity checker (#42)

Tim,

Agreed. The first thing would be to log the failures and the reasons why the workflow fails on particular files. One of the first steps for CMIP or any data files of the same kind would be to check individual files (as produced by the modelling group) to make sure they comply with the a priori agreed 'conventions' regarding, for examples

1) naming of the file 2) naming of the file together with meta data (especially inheritance, i.e. parent experiment, time correspondence, etc ...) 3) data itself, correct missing/fill values, data range given unit given variable, monotonicity and gaps given time frequency and bounds, missing coordinates and coordinates bounds

Then possibly moving to more 'subjective' checks

4) Physically relevant checks (e.g. daily and seasonal cycles

Note that some packages do perform conventions checking either for single files or when apply across files.

Cheers, Didier

From: Tim Bedin notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> Reply-To: CWSL/cwsl-mas reply@reply.github.com<mailto:reply@reply.github.commailto:reply@reply.github.com> Date: Friday, 5 June 2015 4:26 pm To: CWSL/cwsl-mas cwsl-mas@noreply.github.com<mailto:cwsl-mas@noreply.github.commailto:cwsl-mas@noreply.github.com> Cc: Didier Monselesan Didier.Monselesan@csiro.au<mailto:Didier.Monselesan@csiro.aumailto:Didier.Monselesan@csiro.au> Subject: Re: [cwsl-mas] Sanity checker (#42)

Also related to this (and a wider issue than just the plugin) is to work out what to do when the sanity checker fails - how do we deal with 'faulty' data in the archive at NCI?

Reply to this email directly or view it on GitHubhttps://github.com/CWSL/cwsl-mas/issues/42#issuecomment-109175388.

Reply to this email directly or view it on GitHubhttps://github.com/CWSL/cwsl-mas/issues/42#issuecomment-109656913.

University of Tasmania Electronic Communications Policy (December, 2014). This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.

captainceramic commented 9 years ago

In order to get this into the tool it would have to be implemented in a script that takes in the name of the file to check on the command line. Something like:

./sanity_checker tas_1981_..._ACCESS1.0.nc

etc.

We could have it integrated with every workflow, but I think it could be a better approach to have a specific workflow just for data checking.

Another issue is what to do when problems are found - should they be fixed in the downloaded data? Should the files be deleted from the archive? Do the modelling groups need to be contacted this far after release? I know that people at Aspendale have been working on this problem as part of the data archive download.

I am getting in touch with NCI to seek their input on this issue.

DamienIrving commented 9 years ago

During the CWSLab phone meeting last week @taerwin suggested that the checking/cleaning of CMIP5 data could handled by NCI. Instead of pointing at the raw CMIP5 data directory, the CWSLab workflow tool could instead be pointed at directory of data that has passed a quality check. If users found an error in the quality checked data, they could report it to NCI and that data could be removed from the directory until the data were fixed (and an appropriate test could be added to the quality check). NCI would report errors back to PCMDI if necessary.

NCI would obviously need assistance from people like @didiermonselesan and @willrhobbs in defining the checks that should be included in the quality checking procedure. For the sake of transparency it would be good if the code for the tests were hosted on a public GitHub repo (possibly in the CWSL repo).

cet900 commented 9 years ago

Late to the party, sorry. I'm watching this space now so you can contact me here, but email's probably still better for me having a searchable record :) I think you all have my address. Anyway, if/whenever you need NCI input, just give me a yell. We've got some errata tracking pages similar to the ACCESS and CMIP5 official structures but that will be open for all NCI users to edit once our Confluence goes live beyond NCI-staff. In the mean time, just keep reporting anything you find to me directly. Sorry for the inefficiency.... Will be good to see a sanity checker tool developed! :)