Several minor comments - Githubissues

crowston commented 9 years ago

A-5 duplicates A-3 Is C the community producing the data or the community the data are about? I guess the later, but it could be clarified. It might be useful to separate Ethics from IRB compliance. For the later, using the definitions from the IRB for things like human subjects would be good. Of course, that would be only a US view; I don't know the other rules. DS-2: what about datasets that aren't published. Minor typo: D3-6 should be DS-6 DS-6 suggests the need for a taxonomy of processing levels like http://uregina.ca/piwowarj/Think/ProcessingLevels.html Optional DS-4 and DS-5 reuse the numbers Optional DS-5 is the start of a bigger category of Provenance. That could even include pointers to the scripts used, the settings, etc.

libbyh commented 9 years ago

Fixed the duplicates, typos, and numbering issues.

The rest I leave here for people to comment on. I remember the group was hesitant to introduce any controlled vocabularies/ontologies/taxonomies, but that may have changed. Or, referring to one might give people a place to start developing our own?

AniKarenina commented 9 years ago

I'm in favor of DS-4 and DS-5 optional. DS-4 ought to be trivial to provide so hopefully would be, and is often displayed on the systems that would serve the files. This could be filled in post-hoc by a 3rd party.

DS-5 seems too open to interpretation as shown. # of records/lines/entries makes the most sense to me because it's operationally useful (how many entries can Excel open these days?) and relevant without much translation. Your definition of users may not match mine, so beyond the level of records, it starts needing more interpretation that is harder to clarify.

DS-6a and DS-8 aren't adequately distinct to me. One might refer to selection/sampling/queries? I'm trying to think of how this applies in cases where data are obtained by 1) collecting data manually and/or selectively (e.g., by humans, not scraping whole pages), and 2) downloading the data set straight from the community. Replicating would require duplicating specific database queries in case 2 and several others, so that could go alongside "which Twitter API was used" for an example, if that's the kind of information that is intended to fit into that field (and if it doesn't, where should it go?)

Also agree that DS-5 & 6 could become P(rovenance) as a separate category, and DS-7 seems tightly linked to that. A new version implies new provenance due to new selection/processing, right? DS-5 would be partially descriptive of differences in versions and provenance so it would make sense to package together.

I like the notion of the DS-6 processing levels, but would like to see if we could develop our own versions of them. Preferably through empirical evidence as accumulated in MIOCS files.

sgoggins commented 8 years ago

I think this issue is now resolved by the work that has occurred since this meeting. The manifest is now in .json files.

I think this can be closed.

sgoggins commented 8 years ago

This issue has been at least superseded by the multiple manifest versions since Copenhagen.

OCDX / OCDX-Specification

Several minor comments #3