HHS / ckanext-datajson

Custom CKAN extension for Healthdata.gov
http://hub.healthdata.gov/data.json
19 stars 72 forks source link

Publisher Field #6

Open seanherron opened 11 years ago

seanherron commented 11 years ago

I'm working on a modification to the extension to parse out data.json files by the organization they belong to in CKAN. One question I have is with the implementation of the publisher field - why does it map to author in CKAN rather than to organization? Was going to change this around but wanted to check on the rationale behind it first. Thanks!

JoshData commented 11 years ago

Hi, Sean.

A few reasons. The main one to be wary of is that organizations are permissions structures in CKAN. I don't think it would be appropriate to map harvested datasets to organizations based on the publisher field. You risk giving write-permission to something that shouldn't be edited, or losing permission to update it later. If anything, they should all map to a single Harvester organization.

Groups might be more appropriate.

But also, publisher is a string field. Mapping to organizations/groups may be complex and the logic may depend on whose catalog it is. Also, managing the creation/updating/deletion of organizations/groups is a lot more work that I didn't want to get into.

Am definitely not opposed to seeing a way to map datasets to groups though. That'd be very handy.

dwcaraway commented 11 years ago

"You risk giving write-permission to something that shouldn't be edited, or losing permission to update it later. If anything, they should all map to a single Harvester organization."

Don't understand. Are the permissions that are gained/lost with regard to editing metadata?

Our ckan system will use data.json to populate the catalog. we are also offering the ability for users to log in to enter their metadata, upload files, etc. The metadata is used to produce data.json files for the organization.

"But also, publisher is a string field. Mapping to organizations/groups may be complex and the logic may depend on whose catalog it is. Also, managing the creation/updating/deletion of organizations/groups is a lot more work that I didn't want to get into."

Yes, mapping may be inherently complex. We'll likely have to use some machine learning if we start seeing significant variations in the entered data. For right now, though, we'll hope that we won't have to many endpoints to harvest and so can establish standards and procedure to minimize the technical problem of establishing identity.

For us, organization/grouops are great as they're already baked into CKAN already.

Given this information, are there other reasons not to use publisher?

JoshData commented 11 years ago

Well, like I said, I don't think orgs makes sense. Groups makes sense.

But if you guys submit a patch to do either, I'd be glad to merge it.

JoshData commented 11 years ago

Oh, see #5 though --- I merged Fuhu Xia's patch assigning datasets to the org that owns the harvester source.