gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

API for exploring local ipt resources #1249

Closed dvdscripter closed 7 years ago

dvdscripter commented 8 years ago

IPT is amazing publishing software for biodiversity data but sometimes we want to provide possible users of this data a nice interface or make analysis.
Current solution is to use lontra-harvester to index your resources and create your own API at your database stack.
Small institutions can't get hold of this complex stack and maybe just afford a portal to show relevant information about local IPT resources.
I think if IPT provided a API who users/devs can access information without the need to install indexers will really boost the integration with institutions portals and similar.

tigreped commented 8 years ago

I agree. An API to query IPT data would be an awesome asset that might even completely remove the need for the indexing/harvesting tools.

Probably that would also mean to replace the text file structures that the IPT depends on to persist information with databases and/or indexing technologies (SOLR, Elasticsearch, etc.)

Furthermore, I believe the IPT could also implement some sort of data replication strategy that allowed different IPT instances to replicate resources, somehow similar to Metacat's replication function.

dshorthouse commented 8 years ago

That's an awful lot of logic to put in the IPT, especially since the many installs in the wild are at varying states and versions. I vote instead to make it be better known that the UUID of a registered dataset can be used as the datasetKey parameter in GBIF's RESTful API, eg http://api.gbif.org/v1/occurrence/search?datasetKey=275319e1-f91c-406f-b239-62cb9d4185cb and instead concentrate efforts on the performance and capabilities of a single API endpoint. If it's discovery of machine access that's the bottleneck, then perhaps another tab called "API" can be created on eg http://www.gbif.org/dataset/42319b8f-9b9d-448d-969f-656792a69176 which draws in the documentation of the existing GBIF API with eg datasetKey pre-populated either as the query param or as part of the path in the URI.

Furthermore, I'm not sure such an approach would remove the need for harvesting routines. It might simplify the code for such actions by using well-established HTTP and JSON, but it would be substantially slower than working with native DwC-A. What about the extensions within a DwC-A?

I do however agree with a replication function and this deserves careful consideration. Politically, it's a wise move. LOCKSS. But, there are bits within this that make it tricky, eg. DOI registration agency particulars & public accessibility, parent/child synchronization, opt-in/-out, organizational framework (within Nodes or across Nodes?) etc.

tigreped commented 8 years ago

@dshorthouse et al, what we are proposing is indeed a big shift in IPTs purpose, scope and paradigm. We hope you guys allow the concept to sink in for a while. After all, it is just a proposal. =)

The current scenario centralizes everything on GBIF infrastructure. First problem is that in order to consume the data available(published) in a given IPT instance, you would have to wait until GBIF harvests (and processes) this data. That, versus immediately having this data available via the IPT's own API layer as soon as the resource is published.

This limits the capability of accessing data via GBIF API to the resources that are already published on GBIF. Sometimes, that might not be the case and there could be scenarios in some countries where several resources won't be necessarily published to GBIF but still need to be accessible directly from their IPTs. In these cases, this API would come in handy.

There is more. If some of the data providing task was distributed from the GBIF API to the IPTs own API, we could shift GBIF's "obligation" of implementing more user requirements to their API by allowing users themselves to propose and implement them in a more agile manner on the IPT API project.

GBIF's API probably is also very related to the way GBIF interprets and organizes data internally due to their own needs and strategies. Requesting changes to this API feels like asking to change very internal processes. Maybe @timrobertson100, @cgendreau and others could give more insight on that.

I think that if the IPT had an API of its own, it could make much easier having more non-GBIF dev people working and submitting patches to it. That is more of a personal impression of mine, of course, where I think that users/institutions should become more and more empowered into getting involved to manage and improve the IPT stack themselves, turning it into more of an actual community tool. I don't see much opening or purpose in having general users contribute to the more "internal" GBIF projects.

Let me give one example of a relevant problem we have had to deal with over the last year: integrate and provide repatriated data we fetch from GBIF in our own data portal.

Currently, what we have to do is: go to GBIF occurrence search page, fetch a list of all countries that publish data on the Brazillian territory (Country = Brazil) and apply the filter PUBLISHING COUNTRY to each one of those countries from the list (exluding Brazil as a publisher, of course), one by one, requesting download that will be sent by e-mail in zip files that we afterwords have to republish in a repatriated data IPT of our own in test mode resetting metadata.

Finally, from those resources, we harvest data to our dataportal. I'm sure you can see the trouble in such a clumsy process. There is no current implementation on the GBIF API that provides that in an easier way.

This limits us to having to provide only the third party information (GBIF's), instead of the first hand data straight from the source. It is difficult to fetch information about each IPT installation/resource that is providing data from several resources in this aggregated DwC-A assembled by GBIF. There is no current trivial repatriated data process, integrated to the IPT or not.

GBIF's data portal is an excellent aggregator, but I don't see why it has to concentrate so many features, when it could better distribute it within GBIF's other tools or at least provide alternatives.

Of course we hope GBIF as a facility will have long life, but I believe open software has teached us big lessons. If we stick to the open source tools and engage the community around it, things might last longer. In a scenario where GBIF suddenly vanished and had to discontinue services and shutdown production environments, the IPT could still be kept alive, self maintained by their users and institutions.

The replication idea is something else, the idea can be better understood here and should be carried to another issue for further discussion, so I'll let it be for now.

We hope only to contribute with those thoughts and ideas on the possible future steps of the IPT. Best regards to all!

dvdscripter commented 8 years ago

As @tigreped said in repatriated data example gbif has all the capabilities now but we're a bit slow in solutions.
Another example is an institution demand to known how much records they got. If they choose to not register their data on GBIF they will need to make some parsing of IPT list of resources pages.
I'm doing a similar "hack" to monitor all resources from all IPTs at SiBBr partnership. I hope IPT keep the web design or my tool will be broken :no_mouth:
I understand the demanding time will take to make these changes but a elegant solution is needed (API). I think you should consider it not for now but as a long term goal. No publishing tool available has API, no website besides gbif has semi-automatic statistics available.
I'm glad to help if needed!

kbraak commented 7 years ago

Thanks to everyone that has contributed to this issue. An update on each the subjects addressed in this issue can be found below.

Repatriation

I wanted to update everyone that GBIF added a new repatriation filter in June 2016 to its occurrence search. This filter allows searching for occurrence records whose publishing country is different to the country where the record was recorded in. For example, the following link shows the repatriated data of Costa Rica: http://api.gbif.org/v1/occurrence/search?COUNTRY=CR&REPATRIATED=true Note this filter is documented in the GBIF API.

To keep up to date with the latest features added to the GBIF API you please subscribe to the GBIF API Users mailing list. For everyone's convenience, you can read about the two latest rounds of improvements made to the GBIF API in these posts:

Country/Node overview

By leveraging the GBIF API (e.g. ensuring all datasets within a Country are registered with GBIF) the GBIF Node API allows a list of the country's organizations, installations and datasets to be easily retrieved:

Replication

IPT users manage their data locally and by registering their dataset with GBIF make it "available to the greater scientific community via a centralized search". At the same time, the IPT can be part of a broader network (e.g. OBIS) while the administrator retains control over the repository and how it's managed.

To further enhance datasets' accessibility and to guarantee thier long-term archival, GBIF is investigating replication of datasets from the IPT into DataOne.

Registered Dataset Inventory

I wanted to highlight a feature of the IPT that isn't very well known: Registered dataset inventory, which is a simple JSON inventory of all registered resources. This feature can be used to monitor whether a dataset has been properly indexed in GBIF by comparing its target and indexed record counts. More information about this feature can be found in the User Manual here.

abubelinha commented 4 years ago

@kbraak contributed a lot of info on how to retrieve information from gbif API, which is OK. But this issue was not about gbif API: it is about an IPT API, which is a new, different, independent and great idea of @dvdscripter

Also, there are people like @diogok working on it. So I don't understand why this issue has been closed,