coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Permit the management of Collection Configs and Seed URLs via a Commons API #8

Closed PeterCiuffetti closed 3 years ago

PeterCiuffetti commented 3 years ago

Overview

Assumptions

Complications

Proposed Implementation

PeterCiuffetti commented 3 years ago

Estimate this is about 4 days of work, most of this going into developing a local store of info that can be consulted by different plugins that need access to the org and collection information on the other side of this API. And to configure the distribution capability into the solutions.

avorio commented 3 years ago

The API is ready for you, @PeterCiuffetti :)

You have three available filters:

Example request:

https://policycommons.net/api/collections/?sourcing=coherencebot&bucket=extra
{
  "count": 62,
  "next": "https://policycommons.net/api/collections/?bucket=extra&page=2&sourcing=coherencebot",
  "previous": null,
  "results": [
    {
      "id": 1206,
      "title": "Publications",
      "slug": "publications",
      "url": "https://oceana.org/publications/",
      "sourcing": "coherencebot",
      "bucket": "extra",
      "org": {
        "slug": "oceana",
        "name": "Oceana",
        "acronym": null
      }
    },
    {
      "id": 1207,
      "title": "Reports",
      "slug": "reports",
      "url": "https://eciu.net/analysis/reports",
      "sourcing": "coherencebot",
      "bucket": "extra",
      "org": {
        "slug": "energy-and-climate-intelligence-unit",
        "name": "Energy and Climate Intelligence Unit",
        "acronym": "ECIU"
      }
    },

[...]

The authentication is done via x-api-key in the header, as per usual.

I've organised all of the CoherenceBot collections already ingested so far into two buckets:

You can retrieve them by using a combination of e.g. ?sourcing=coherencebot&bucket=coherencebot-batch-1. Mind you, this API endpoint returns all collections in the Commons, not only the ones for CoherenceBot. Hence, it's important to specify the sourcing attribute correctly.

I believe that a good starting point is the extra bucket. It has 62 collections today, none of which has already been ingested by CoherenceBot, but all of them have been properly connected to an organisation and vetted by either Toby or myself.

And finally, the admin, should you need to use it, is available here:

https://policycommons.net/admin/artifacts/collection/

You can use similar filters on the right-hand side.

PeterCiuffetti commented 3 years ago

Just playing with this a bit to familiarize myself with it...

this request

curl -v -H 'x-api-key: ...key...' 'https://policycommons.net/api/collections/?sourcing=coherencebot&bucket=extra'

Indicated count:74 but returned only 10. Are there paginator params to get the next page(s)?

PeterCiuffetti commented 3 years ago

...seeing now the "next": and "previous": URLs in the header. So, previous question is answered.

PeterCiuffetti commented 3 years ago

This is now finished, tested and deployed on all three clusters.

The solution uses a custom injector class called FeedInjector.

It is called by crontab (local for user hadoop). This is currently set up to run every hour at the top of the hour.

It uses the params ?sourcing=coherencebot&cluster=