University of Texas Rio Grande Valley (PAUH)

sarayourfriend commented 1 year ago

Provider API Endpoint / Documentation

https://portal.torcherbaria.org/portal/collections/misc/collprofiles.php?collid=456

Provider description

This is a Darwin Core formatted data dump of CC BY-NC scientific grade images of plant observations.

Licenses Provided

CC BY-NC

Provider API Technical info

We would implement this much in the same way as iNaturalist was implemented. See https://github.com/WordPress/openverse/issues/1608 for the history and discussion of that project.

We would need to reference (scrape) this page to know when a new dataset was available for processing: https://portal.torcherbaria.org/portal/collections/datasets/datapublisher.php

The dataset link is stable and the upstream provider images for the dataset are torcherbaria links.

Checklist to complete before beginning development

[X] Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
[X] Verify the API provides license info (license type and version; license URL provides both, and is preferred)
[X] Verify the API provides stable direct links to individual works.
[X] Verify the API provides a stable landing page URL to individual works.
[X] Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
[ ] Attach example responses to API queries that have the relevant info.

Implementation

[ ] 🙋 I would be interested in implementing this feature.

ngken0995 commented 1 year ago

@sarayourfriend can I work on this?

obulat commented 1 year ago

Most commonly, Openverse parses the provider API responses to get the media data. There is also a process for parsing SQL dumps from iNaturalist. This provider will require creating a special process of parsing CSV data, which will be a large project. Would you like to create the process of parsing csv and adding this provider, or would you prefer to add a provider that has an API (and so, adding it to Openverse would follow a well-established pattern)?

@stacimc, @rwidom, this provider has data in CSV. What would the best way of parsing CSV be? Pandas or polars, or something else? Would we need to create a different DataIngester class for such providers?

ngken0995 commented 1 year ago

@obulat I'm still new to the codebase of Openverse. I should find a provider that has an API already before moving on to specaial process of parsing CSV data. There are three open provider(Digital Commonwealth, DigitalNZ and National Library of Australia Which one should I work on?

obulat commented 1 year ago

@obulat I'm still new to the codebase of Openverse. I should find a provider that has an API already before moving on to specaial process of parsing CSV data. There are three open provider(Digital Commonwealth, DigitalNZ and National Library of Australia Which one should I work on?

You can take any issue with 'provider:...' label, e.g. #1771.

Use the just command to create the script from a template: https://github.com/WordPress/openverse/blob/dde77d1869a17aed6f9fdd33385fdd7e02146366/catalog/justfile#L145

ngken0995 commented 1 year ago

@obulat I'm still new to the codebase of Openverse. I should find a provider that has an API already before moving on to specaial process of parsing CSV data. There are three open provider(Digital Commonwealth, DigitalNZ and National Library of Australia Which one should I work on?

You can take any issue with 'provider:...' label, e.g. #1771.

Use the just command to create the script from a template:

https://github.com/WordPress/openverse/blob/dde77d1869a17aed6f9fdd33385fdd7e02146366/catalog/justfile#L145

@obulat Okay. Thank you for the information. I would like to work on #1771.

sarayourfriend commented 1 year ago

I think for this provider we should develop a standard Darwin Core approach, hopefully something flexible enough that it can also take in Dublin Core (which Darwin Core is based on). Theoretically one generic provider could be created that can handle all these related providers: https://github.com/WordPress/openverse/issues?q=is%3Aissue+is%3Aopen+Darwin+Core+label%3A%22%E2%98%81%EF%B8%8F+provider%3A+images%22+

Leveraging those standard metadata formats would go a long way in making it easier to ingest GLAM and scientific observation data in the future.

WordPress / openverse