FAIR-Data-EG / consultation

A call for contributions to the report of the FAIR Data Expert Group
Other
16 stars 3 forks source link

What is the first step in "F" ? #10

Open brucellino opened 7 years ago

brucellino commented 7 years ago

I would like to take the case of a new data repository, in an environment which has a nebulous level of hierarchy, then ask the question : "How does this data repository get found ?"

One may consider many African countries and institutes, which may have data repositories, but which are "invisible".

How is data currently discovered ?

Data repositories are all over the place. Many of them serve such a restricted purpose that it makes no sense to talk of "finding" them because everyone who needs to know about them already does so. However, what about the ones which could be FAIR (ie, their technology and licensing allows the AIR part), but are not F because nobody knows about them.

Typically the way things are found is via "search". A single-point of entry to finding data is really useful - and it's reasonable to assume that this will depend on the community in practice. Astronomers have ADS, biologists have a whole mess of databases, climate scientists have ESGF, Earth Observers have [GEO](http://www.geoportal.org/, etc. (ok, these are comparing apples and oranges in many cases).

Sometimes there's a federation based on a metadata standard like DDI...

But for the data to be returned in a search, the repositories still need to be indexed, monitored etc.

In the best case scenario, perhaps there's an overarching body at the national or community level which proposes best practice in bringing new repositories in, how they are evaluated and supported, etc.

In the case of African countries, there is rarely any such oversight, and much data either gets lost over time or simply is never seen by anyone. Essentially, we can never check the AIR bits, because we never F the data !

What is being done to support the inclusion of new repositories ?

  1. What technology guidelines are there for creating FAIR-friendly repository ?
  2. What policy guidelines are there for "registering" the repository with an indexing service or authority ?
  3. For monitoring purposes, how does one communicate with the repo maintainer on the state and compliance of their repository ?

These are clearly complex and subtle questions with no single answer. I'd be happy to narrow down any aspects in discussion.

asmatspatial commented 7 years ago

V good post and the ideas therein. Agree with the contents. Like to add, First step in enabling Findability is to make an inventory of all the relevant datasets (some people call it Data Catalog, too). The inventory should be maintained by a national body/organization. The inventory would contain metadata information of the underlying datasets.

markwilkinson commented 7 years ago

I tend to disagree, v.v. how this should be accomplished. One of the visions for FAIR was that it would enable data reuse in a decentralized world. Similarly, the vision for the EOSC also mentions the desire to be decentralized. As such, I don't think the solution is to build a bunch of additional nationally-run, domain-specific silos - that might lead us to unmaintainable infrastructure that disappears after a few years. Optimally, Google and other search engines would support discovery (increasingly true via richer Schema.org and Google's support for dataset descriptors [https://developers.google.com/search/docs/data-types/datasets]. Specialized (defined however the user wishes) indexes may arise from user-demand, or for other reasons, but I don't see them as being a requirement for FAIRness, and not necessarily accomplished by (dependent on) a national infrastructure.

ghost commented 7 years ago

In our research we've determined that the following steps can be taken to reach a FAIR basis for repositories: 1) Assign persistent Identifier such as DOI, HANDLE, URN 2) Choose license standard such as Creative Commons 3) Apply metadata standard such as Dublin Core 4) Use http(s) as communication protocol standard.

The next step would be focusing and advancing towards community relevant standards.

CaroleGoble commented 7 years ago

Bioschemas.org is taking steps to address the F in FAIR through lightweight extensions to a web "de facto" standard - schema.org. This takes point 3 of jkb4TU's points and extends it. Moreover it argues for

Bioschemas.org will be running a workshop 2-5 October at EBI Hinxton, Cambridge. There we will in the first 2.5 days present adoption examples from the Life Sciences, and the last 2.5 days is an open meeting for EOSC, examining a number of options for addressing lightweight mechanisms for F.

antbro commented 7 years ago

I agree that 'Findabiliy' (or Discovery, as often called) is a critical bridge between producing and sharing. And also agree that decentralised is the optimum, wherein BioSchemas can help. But while catalogs are popping up left, right and center, and BioSchemas are gaining speed - all of this is about discovery based on general characteristics of the data, not the data themselves. This means we can find places where things exist that 'might' be suitable in terms of content for one's intended purpose, and might be allowed to be used for the intended purpose. But equally they might not be! So the real impact of Findability/Discoverability will come when we can query the data directly - in ways that still protect the data. This is doable, and we and others have been doing it for some time now (various approaches, its almost a science in itself !). Thereby people can find EXACTLY what they want/need, know they'll be allowed to access it under acceptable terms, and can even undertake feasibility studies there and then (e.g., determining number of cases in a dataset with certain characteristics, to allow power estimates). GA4GH has not yet set about defining a general model/API for enabling resources to federate on this level, and FAIR plus the EOSC would be a perfect environment to take all this forward. Happy to discuss further, Anthony Brookes

CaroleGoble commented 6 years ago

I completely agree with Mark. The WWW way is self-management, publish, partial but managed anarchy that is robust. we must AVOID centralised and highly constrained approaches.

the EU is OBSESSED by big portals and managed catalogues. Loosen up! Why not searchable by a search engine? Open means open. Pull as well as push. let folks publish nand harvest (thats the bioschemas way...)

The challenge will be distilling the “in common” without enforcing one view or need.

antbro commented 6 years ago

Well said! I fully agree Cheers Tony


From: CaroleGoble notifications@github.com Sent: 31 July 2017 21:27 To: FAIR-Data-EG/consultation Cc: Brookes, Anthony J. (Prof.); Comment Subject: Re: [FAIR-Data-EG/consultation] What is the first step in "F" ? (#10)

I completely agree with Mark. The WWW way is self-management, publish, partial but managed anarchy that is robust. we must AVOID centralised and highly constrained approaches.

the EU is OBSESSED by big portals and managed catalogues. Loosen up! Why not searchable by a search engine? Open means open. Pull as well as push. let folks publish nand harvest (thats the bioschemas way...)

The challenge will be distilling the “in common” without enforcing one view or need.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/FAIR-Data-EG/consultation/issues/10#issuecomment-319186567, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI_EVKrsFcsXICeakiI-_duwyU0CJhwUks5sTjjGgaJpZM4N5pMq.

peter-wittenburg commented 6 years ago

Brucellino raises a very good point which wass mentioned in meetings now several times - the most recent one indeed in South Africa from researchers from most South-Sahel African countries. It is also being discussed in RDA groups, since refering to Google (or so) search engines does not help - in general those with the biggest mouth will win in the Google search engine, the others disappear in the nowhere except one is able to specify the search question in a very detailed way. But this is the point. In RDA the following levels of discussion were taken (and I guess that no one claims to have THE optimal solution).

Don't know whether this helps, but thanks again to Brucellino to raise this point and in the report we need to address this topic somehow.

Daniel-Mietchen commented 6 years ago

Apparently, there is a RDA Interest Group working on discovery, of which I have just seen this snippet of advice: https://twitter.com/TheDataMonsters/status/910538903844720646 . I asked for more details.

band commented 6 years ago

There is nothing new in that advice.

The information, suggestions, and examples in this issue comment stream are better and provide ideas about creating measurable properties.

antbro commented 6 years ago

Thanks Daniel!! Strange 10 point list. Seems more about how to find, not how to make findable? Making data/samples/subjects accurately findable (low false positive and false negative rates), based on data level rather than metadata level info, in a safe way that protects confidentiality, and enables those discoveries to be based on data linkage across a federated lattice (across multiple types of research and healthcare data) is a science in itself. All initiatives interested in this need to join forces. Cheers Tony

Professor Anthony J Brookes Department of Genetics University of Leicester University Road Leicester, LE1 7RH, UK Tel: +44 (0)116 2523401 Mbl: +44 (0)777 0620396


From: Daniel Mietchen notifications@github.com Sent: 20 September 2017 17:28 To: FAIR-Data-EG/consultation Cc: Brookes, Anthony J. (Prof.); Comment Subject: Re: [FAIR-Data-EG/consultation] What is the first step in "F" ? (#10)

Apparently, there is a RDA Interest Group working on discovery, of which I have just seen this snippet of advice: https://twitter.com/TheDataMonsters/status/910538903844720646 . I asked for more details.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/FAIR-Data-EG/consultation/issues/10#issuecomment-330907367, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI_EVBarKEMGWyw9uhPA3fu4fN4n0NF7ks5skT1AgaJpZM4N5pMq.

antbro commented 6 years ago

Thanks Daniel!! Strange 10 point list. Seems more about how to find data, not how to make findable? IMO...making data/samples/subjects accurately findable (low false positive and false negative rates), based on data level rather than metadata level info, in a safe way that protects confidentiality, and enables those discoveries to be based on data linkage across a federated lattice (across multiple types of research and healthcare data) is a science in itself. All initiatives interested in this need to join forces. Cheers Tony

Professor Anthony J Brookes Department of Genetics University of Leicester University Road Leicester, LE1 7RH, UK Tel: +44 (0)116 2523401


From: Daniel Mietchen notifications@github.com Sent: 20 September 2017 17:28 To: FAIR-Data-EG/consultation Cc: Brookes, Anthony J. (Prof.); Comment Subject: Re: [FAIR-Data-EG/consultation] What is the first step in "F" ? (#10)

Apparently, there is a RDA Interest Group working on discovery, of which I have just seen this snippet of advice: https://twitter.com/TheDataMonsters/status/910538903844720646 . I asked for more details.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/FAIR-Data-EG/consultation/issues/10#issuecomment-330907367, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI_EVBarKEMGWyw9uhPA3fu4fN4n0NF7ks5skT1AgaJpZM4N5pMq.