W3C-HCLSIG / HCLSDatasetDescriptions

7 stars 13 forks source link

Feedback from testing on naive user #72

Closed mellybelly closed 9 years ago

mellybelly commented 10 years ago

Here are some considerations based on a conversation with Anne Thessen and Chris Mungall in attempting to annotate the GLOBI dataset with the new description. Will send sample files.

We need the ability to better attribute roles that people play in creation, maintenance/curation, or publication of dataset, e.g. not just author, can be curator, contributor, processing analysis, etc. As per #66 I think these could come from and/or be implemented in VIVO-ISF, where there is a good representation of the relationship between people, organizations, and scholarly products. Or Pav might be fine and could be enhanced, but should be aligned with other efforts going on to relate contributions to publications.

Contributors, curators, etc. can change from version to version. The summary level description could subsume all of these. Basically if you have contributions of various types to any given version of a dataset, then these should be be additive in the summary.

What is relationship between publication and dataset? this is what cito:citesAsAuthority is trying to do, but this too could be a bit more specific, because is the paper about the dataset? or used the dataset? produced the dataset? How to use this if the citation is not a traditional published article?

Not all datasets have publisher and license, so these should not be "must" ?

micheldumontier commented 10 years ago

Hi Melissa, thanks for these comments. can you point to documentation that shows how to use VIVO-ISF for these relationships? wrt level assignment: we picked version/distribution level for this information precisely as you indicated - that these information are subject to change from version to version (or possibly even at the distribution level). since we mandate the use of versions, certainly where there is a distribution, then it makes sense to look there. but what if there is no version or distribution information for the dataset? that it is a lightweight description alltogether, and the kind which appears in a (legacy) registry of datasets. I tend to agree that perhaps it is too harsh to outlaw this annotation at the summary level. perhaps @AlasdairGray has something to say about this.

All datasets should definitely have a publisher or organization that was responsible for producing it in the first place. However, in terms of license, we had an extensive debate about this in issue #65, where we agreed to

Summary Level: Change to a MAY Version Level: Leave as a MUST Description Level: Leave as a MUST

hope that helps.

KimJBaran commented 10 years ago

@mellybelly Have you sent the sample files yet?

micheldumontier commented 10 years ago

yes. she sent them to me, and I reviewed them. I only made a suggestion on using URIs for references (instead of strings), that way they can be linked to DOIs or whatever.

mellybelly commented 10 years ago

@micheldumontier do you want to post the corrected files? please feel free (or I can later.. but crazy travel at the moment).

Also, producers of data can be any agent, not necessarily an organization or publisher? or is it a requirement to borrow the publisher/org from the agent in some way. So for example, if I post a data set about number of microbreweries opening over time and their proximity to a certain latitude and longitude, must I necessarily state that OHSU is the parent organization? or Oregon Homebrewers Assoc? Maybe since this is largely focused on HCLS data it is not so much of an issue, but for any dataset might be a bigger one. Or could you use the site where it is posted? such as GitHub or Figshare, etc? Probably just my own naiveté but maybe good for FAQs.

micheldumontier commented 10 years ago

@mellybelly I don't have the corrected versions...

i would think that producers of data can be any agent who claims responsibility for making it public.

mellybelly commented 10 years ago

ok will get them posted here as soon as we can. Thx.

AlasdairGray commented 10 years ago

On 29 Jul 2014, at 02:22, Michel Dumontier notifications@github.com<mailto:notifications@github.com> wrote:

wrt level assignment: we picked version/distribution level for this information precisely as you indicated - that these information are subject to change from version to version (or possibly even at the distribution level). since we mandate the use of versions, certainly where there is a distribution, then it makes sense to look there. but what if there is no version or distribution information for the dataset? that it is a lightweight description alltogether, and the kind which appears in a (legacy) registry of datasets. I tend to agree that perhaps it is too harsh to outlaw this annotation at the summary level. perhaps @AlasdairGrayhttps://github.com/AlasdairGray has something to say about this.

I disagree. I think it is essential to keep the distinction between the summary (abstract) notion of a dataset and the specific versions. So even these legacy datasets need to make two resources – the summary level and the version level descriptions. This enables the creation of future releases of the dataset without needing to change previously published data.

With regard to giving credit to individuals involved in the creation etc of a dataset. First we do not prevent the use of other vocabularies, so something like vivo could be used to annotate version and distribution level descriptions. PAV is simply a suggestion here. We do however require that some less informative dcterms properties are provided to ensure interoperability. (Since no agreement could be achieved on which vocabulary to mandate it seemed most sensible to have a less defined property that must be provided so that consumers could know that at least this property would be present.)

Finally, putting authors on the summary level is going to cause problems. First and foremost, it puts a maintenance burden on the summary level description; something we have sought to avoid. Note that this information can be captured through a query that returns the author information from each of the version level descriptions. Second there is an interpretation issue. What is the meaning of the author property on the summary level as opposed to the version level? However, there is the same issue between the version level and the distribution level.

Perhaps an example will help focus the discussion

Summary Level

:dataset a dcts:Dataset; dct:author :bill, :ben.

Version level for version 1

:datasetV1 a dcts:Dataset; dct:isVersionOf :dataset; dct:author :bill.

Version level for version 2

:datasetV2 a dcts:Dataset; dct:isVersionOf :dataset; dct:author :bill, :ben.

Consider the consequences of adding a third version of the dataset

Version level for version 3

:datasetV3 a dcts:Dataset; dct:isVersionOf :dataset; dct:author :alice.

Who is responsible for adding :alice to the summary level (file)?

Alasdair

All datasets should definitely have a publisher or organization that was responsible for producing it in the first place. However, in terms of license, we had an extensive debate about this in issue #65https://github.com/joejimbo/HCLSDatasetDescriptions/issues/65, where we agreed to

Summary Level: Change to a MAY Version Level: Leave as a MUST Description Level: Leave as a MUST

hope that helps.

— Reply to this email directly or view it on GitHubhttps://github.com/joejimbo/HCLSDatasetDescriptions/issues/72#issuecomment-50423941.

Alasdair J G Gray Lecturer in Computer Science, Heriot-Watt University, UK. Email: A.J.G.Gray@hw.ac.ukmailto:A.J.G.Gray@hw.ac.uk Web: http://www.alasdairjggray.co.uk ORCID: http://orcid.org/0000-0002-5711-4872 Telephone: +44 131 451 3429 Twitter: @gray_alasdair


We invite research leaders and ambitious early career researchers to join us in leading and driving research in key inter-disciplinary themes. Please see www.hw.ac.uk/researchleaders for further information and how to apply.

Heriot-Watt University is a Scottish charity registered under charity number SC000278.

AlasdairGray commented 10 years ago

Additionally, what should be the interpretation of the following:

:distribution1 a dcat:Distribution;
  dct:author :claire.

Are :claire and :bill both authors of version 1?
This would be the consequence of the suggestion for the summary level being a superset of version level.

micheldumontier commented 10 years ago

re: contributors. some applications will care about who contributed to what version. one can aggregate across all versions to get the list of all creators, contributors.

CiTo (http://purl.org/net/cito/) has a rich set of relationships ; we use citesAsAuthority which is defined by "The citing entity cites the cited entity as one that provides an authoritative description or definition of the subject under discussion." CiTo would be the right place to find other desired relationships.

re: publisher. We acknowledge that it may be difficult to determine precisely who is responsible in the publication of a dataset. here, we use the dublin core metadata term "publisher", and intend to use their definition. we will update the note to reflect that ( @AlasdairGray )