caltechlibrary / irdmtools

A Go and Python package for working with InvenioRDM repositories.
https://caltechlibrary.github.io/irdmtools
Other
1 stars 1 forks source link

Harvesting challenges and constraints #1

Closed rsdoiel closed 1 year ago

rsdoiel commented 1 year ago
  1. Getting all record ids via OAI-PMH Identifier lists is slow due to rate limiting (slow measured in minutes)
  2. The time resolution of OAI-PMH Identifier lists is one day and I need at least down to the minute resolution
  3. Harvesting more than 5,000 records it too slow (measure in hours), estimate time to harvest 100,000 records is 24 hours

It is reasonable for harvest less than 500 records very quickly via the RDM REST API (records) if you can identify the keys easily (e.g. via using the request/query API).

A mitigation strategy would be to replicate the RDM metadata in PostgreSQL and re-implement a read API (and any other end points we desire) against the replicate PostgreSQL content. You could have an rdmapid like service I implemented with EPrints.

RDM uses a ORM for managing object storage into PostgreSQL. The ORM is Python specific. A service like ep3apid would be better run using Go but either the replication process would need to using the ORM to write JSON objects to a column or their needs to be a reliable way to assemble and validate the Object in Go (e.g. either functions operating on a map or something akin to eprint3x.go in eprinttools).

rsdoiel commented 1 year ago

One option would be to make a dynamic throttle based on the total number of keys to finish the request but RDM throttles by IP address so if you that wouldn't solve the problem where rdmutil is being run by a process and an individual on the same machine.

rsdoiel commented 1 year ago

I am going to implement harvesting new ids by setting up PostgREST along side RDM. That way I do can the queries I need to populate feeds efficiently and avoid the RDM API throttle on OAI-PMH.

Need to port the techniques I had figured out from feedtools to irdmtools' harvesters.

To quickly generate the documents to push to S3 I am going to implement Newt router which with the help of Postgres+PostgREST make generating the aggegated lists easier and more consistent. I plan to use Pandoc server microservice for rendering all the formats (e.g. BibTeX, RSS, jsonfeeds).

rsdoiel commented 1 year ago

I've done various experiments with a clone of CaltechDATA and CaltechAUTHORS (RDM) Postgres databases. A "schema" can be added to an existing Postgres database. Once we have a schema and permissions related to that schema we can use PostgREST as an alternative JSON API. This supports very fast object retrievals.

Using the separate schema should and attaching any additional functions/views to that schema will avoid collisions when upgrading the database schema by RDM. In addition, for even more data safety we can replicate the database to a separate Postgres instance and work from that. There are lots of HA approaches with Postgres and a replicated databases. A replicated database should trail close enough for near realtime updates especially given the speed at which records are added or updated to CaltechAUTHORS generally.

Postgres+PostgREST+Newt also provides us opportunities for a rich reporting environment.