Permit the management of Collection Configs and Seed URLs via a Commons API

PeterCiuffetti commented 3 years ago

Overview

Assumptions

CoherenceBot is running on multiple EMR clusters in different AWS regions
CoherenceBot will have mechanisms to receive new collections to manage, and these collections will have one or more seed URLs feeding that collection
Collection activity can be added, updated or removed

Complications

Nutch's database contains info on URLs only. The management of collections will need a custom database to manage this info. Let's call this the "Collection Database" that is visible to CoherenceBot.
Nutch currently receives its seed URLs from a config file. There is an 'inject' phase that happens once in the first loop. I need to explore how to get this inject step to happen when new collections arrive.
We will need the ability to distribute the responsibility for a collection to a single EMR cluster. While this might be regionally assigned, a more equitable load balancing should be considered.
Should this be a 'pull' or a 'push' mechanism from CoherenceBot's perspective? Since CoherenceBot runs as a loop, it could, at the top of the loop, connect to the API and ask - "This is CoherenceBot Asia: do you have any new collections changes for me (add, delete, or other adjustments to crawl delays on a collection)?" and inject or delete seed URLs accordingly. This query would then happen several times a day in many cases, but probably not hourly. In any case even if the Commons pushes new work into the collections database on its own, CoherenceBot will probably only be able to check for new work once per loop.
Its not clear how best to distribute the work to different regions. We can use country code, but then large countries like the US will need further breakdown. So let's also include State.
To handle multiple think tanks in a single university domain, the seed URLs assigned to a collection will be the basis for determining which other generated URLs belong to that collection. So there needs to be no ambiguity that https://a/b/c/d is assigned to a single collection. The full path structure of the seed URLs will be used as a prefix.
Collection creation is a complication. Again this probably needs to happen on a per-loop basis. On the bottom of loop 'n', CoherenceBot will export all PDFs selected in that iteration to a S3 collection file associated with that PDF's URL prefix. This could mean that a collection file stored on S3 could have a single PDF in it.

Proposed Implementation

CoherenceBot will override a class in the main looping mechanism that calls the Common API to adjust its collection info and seed URL list.
CoherenceBot will need to know what country list it it responsible for and either announce this in the API call, or filter the results of the api response to only those collections it considers itself responsible for.
Factors in the API request could include: Region, Document Type. E.g "I am the Africa CoherenceBot responsible for PDFs", or if it's doing its own filtering, it can just ask for all collections modifications since tiimestamp x (its last request for new work).
Details in the API response
- Collection ID
- Collection Name
- Seed URL array
- Org ID
- Org Name
- Org Country
- Org State
- Collection action: Add, Delete, Update
We may want a mechanism to confirm to the Commons which CoherenceBot cluster has taken responsibility for a collection. So either the collection database is exposed to external queries, or there need to be another commons api which can be used to update the commons.

PeterCiuffetti commented 3 years ago

Estimate this is about 4 days of work, most of this going into developing a local store of info that can be consulted by different plugins that need access to the org and collection information on the other side of this API. And to configure the distribution capability into the solutions.

avorio commented 3 years ago

The API is ready for you, @PeterCiuffetti :)

You have three available filters:

bucket
sourcing
organization__country

Example request:

https://policycommons.net/api/collections/?sourcing=coherencebot&bucket=extra

{
  "count": 62,
  "next": "https://policycommons.net/api/collections/?bucket=extra&page=2&sourcing=coherencebot",
  "previous": null,
  "results": [
    {
      "id": 1206,
      "title": "Publications",
      "slug": "publications",
      "url": "https://oceana.org/publications/",
      "sourcing": "coherencebot",
      "bucket": "extra",
      "org": {
        "slug": "oceana",
        "name": "Oceana",
        "acronym": null
      }
    },
    {
      "id": 1207,
      "title": "Reports",
      "slug": "reports",
      "url": "https://eciu.net/analysis/reports",
      "sourcing": "coherencebot",
      "bucket": "extra",
      "org": {
        "slug": "energy-and-climate-intelligence-unit",
        "name": "Energy and Climate Intelligence Unit",
        "acronym": "ECIU"
      }
    },

[...]

The authentication is done via x-api-key in the header, as per usual.

I've organised all of the CoherenceBot collections already ingested so far into two buckets:

coherencebot-batch-1 (109 collections)
coherencebot-batch-2 (144 collections)

You can retrieve them by using a combination of e.g. ?sourcing=coherencebot&bucket=coherencebot-batch-1. Mind you, this API endpoint returns all collections in the Commons, not only the ones for CoherenceBot. Hence, it's important to specify the sourcing attribute correctly.

I believe that a good starting point is the extra bucket. It has 62 collections today, none of which has already been ingested by CoherenceBot, but all of them have been properly connected to an organisation and vetted by either Toby or myself.

And finally, the admin, should you need to use it, is available here:

https://policycommons.net/admin/artifacts/collection/

You can use similar filters on the right-hand side.

PeterCiuffetti commented 3 years ago

Just playing with this a bit to familiarize myself with it...

this request

curl -v -H 'x-api-key: ...key...' 'https://policycommons.net/api/collections/?sourcing=coherencebot&bucket=extra'

Indicated count:74 but returned only 10. Are there paginator params to get the next page(s)?

PeterCiuffetti commented 3 years ago

...seeing now the "next": and "previous": URLs in the header. So, previous question is answered.

PeterCiuffetti commented 3 years ago

This is now finished, tested and deployed on all three clusters.

The solution uses a custom injector class called FeedInjector.

It is called by crontab (local for user hadoop). This is currently set up to run every hour at the top of the hour.

It uses the params ?sourcing=coherencebot&cluster=

coherentdigital / coherencebot