Open sarayourfriend opened 1 year ago
@sarayourfriend can I work on this?
Most commonly, Openverse parses the provider API responses to get the media data. There is also a process for parsing SQL dumps from iNaturalist. This provider will require creating a special process of parsing CSV
data, which will be a large project. Would you like to create the process of parsing csv and adding this provider, or would you prefer to add a provider that has an API (and so, adding it to Openverse would follow a well-established pattern)?
@stacimc, @rwidom, this provider has data in CSV
. What would the best way of parsing CSV be? Pandas or polars, or something else? Would we need to create a different DataIngester
class for such providers?
@obulat I'm still new to the codebase of Openverse. I should find a provider that has an API already before moving on to specaial process of parsing CSV
data. There are three open provider(Digital Commonwealth, DigitalNZ and National Library of Australia Which one should I work on?
@obulat I'm still new to the codebase of Openverse. I should find a provider that has an API already before moving on to specaial process of parsing
CSV
data. There are three open provider(Digital Commonwealth, DigitalNZ and National Library of Australia Which one should I work on?
You can take any issue with 'provider:...' label, e.g. #1771.
Use the just command to create the script from a template: https://github.com/WordPress/openverse/blob/dde77d1869a17aed6f9fdd33385fdd7e02146366/catalog/justfile#L145
@obulat I'm still new to the codebase of Openverse. I should find a provider that has an API already before moving on to specaial process of parsing
CSV
data. There are three open provider(Digital Commonwealth, DigitalNZ and National Library of Australia Which one should I work on?You can take any issue with 'provider:...' label, e.g. #1771.
Use the just command to create the script from a template:
@obulat Okay. Thank you for the information. I would like to work on #1771.
I think for this provider we should develop a standard Darwin Core approach, hopefully something flexible enough that it can also take in Dublin Core (which Darwin Core is based on). Theoretically one generic provider could be created that can handle all these related providers: https://github.com/WordPress/openverse/issues?q=is%3Aissue+is%3Aopen+Darwin+Core+label%3A%22%E2%98%81%EF%B8%8F+provider%3A+images%22+
Leveraging those standard metadata formats would go a long way in making it easier to ingest GLAM and scientific observation data in the future.
Provider API Endpoint / Documentation
https://portal.torcherbaria.org/portal/collections/misc/collprofiles.php?collid=456
Provider description
This is a Darwin Core formatted data dump of CC BY-NC scientific grade images of plant observations.
Licenses Provided
CC BY-NC
Provider API Technical info
We would implement this much in the same way as iNaturalist was implemented. See https://github.com/WordPress/openverse/issues/1608 for the history and discussion of that project.
We would need to reference (scrape) this page to know when a new dataset was available for processing: https://portal.torcherbaria.org/portal/collections/datasets/datapublisher.php
The dataset link is stable and the upstream provider images for the dataset are torcherbaria links.
Checklist to complete before beginning development
Implementation