ckan / ideas

[DEPRECATED] Use the main CKAN repo Discussions instead:
https://github.com/ckan/ckan/discussions
40 stars 2 forks source link

More specific dataset types. #135

Open rossjones opened 9 years ago

rossjones commented 9 years ago

As a user I find that when I am uploading datasets on a quarterly basis, I often have to create a new dataset just for that quarter if I have too many files for a single dataset - one dataset with 100 resources (say 25 for each quarter) is hard to work with in the UI. This also applies to datasets that are updated monthly, or worse, weekly. I'd still however like to keep a year worth of data in a single dataset.

Should CKAN provide specific types of datasets out of the box, but all available at the same /datasets url?

Specifically could/should CKAN provide a QuarterlyDataset, where the user had to specify which quarter the resource belonged to, and/or a MonthlyDataset where the user had to specify which Month the resources belonged. This would make it much easier, at least at the UI level to group resources together so that they are easily manageable. I am envisioning at UI where the four quarters of the year are addressable directly, and also grouped within the UI so that users can go straight to the quarter they are interested in?

This would also potentially simplify reporting on when particular datasets are 'late', and other reports that required the regular updating of datasets.

marks commented 9 years ago

Just curious - why not have time-scoped datasets under a main dataset be resources of the data set? Just seeing if there's an easy win here that requires no code change and just a methodology/governance change?

Mark

rossjones commented 9 years ago

This is how the old resource_groups used to work - although it looked like resources belonged to datasets, they didn't, they belonged to a resource_group and that belonged to the package. This made it unwieldy when working with resources - and CKAN core made the assumption there was only ever one. It was me that removed them, so now I feel bad :(

As of CKAN 2.3 resources reference (via package_id) the package they belong to, and the preferred approach is to 'group' resources using extras. This, performance wise, is much better than it used to be. But we have potentially lost something in how we had the (unimplemented) ability to group resources together under some heading or other.

Your suggestion is a good alternative to what I want I suppose. There are lots of changes we could be making to the model though, I'm not sure one dataset many resources is enough - but maybe I am overthinking it. We could certainly do with exposing the PackageRelationship models via the UI though, this would make grouping datasets at least more effective (i.e. I could just release one dataset per quarter and group them with a relationship).

wardi commented 9 years ago

I'd like to have a good solution for serial releases too.

A different hack would be to use a datastore table to store all the serial release metadata. E.g. you can have a single resource with a "weekly" resource datastore table like:

year week released url
2015 11 2015-03-01 http://example...
2015 12 2015-03-08 http://example...

That could grow much larger than the number of resources in a dataset, and you can search on the ranges you're interested in with the existing datastore search.

This would also let you have monthly/quarterly releases on the same dataset, and your dataset metadata won't grow in an unbounded way.

rossjones commented 8 years ago

We currently do this on DGU by breaking up resources into three types based on the type field. For timeseries we then group but only by year after forcing the user to specify dates for each resource.

I think it would be useful to have a temporal field in the default resource schema (which could be a date or a range) so that we can do the grouping automatically if the data is present. Currently we rely on people adding this field as extras, but knowing when a resource is for is something that crops up from time to time, but having the temporal field on the dataset isn't that useful.