OpenTermsArchive / engine

Tracks contractual documents and exposes changes to the terms of online services.
https://opentermsarchive.org
European Union Public License 1.2
105 stars 29 forks source link

Define a federated Open Terms Archive collections APIs #1016

Closed Ndpnt closed 10 months ago

Ndpnt commented 11 months ago

Context and Problem Statement

Open Terms Archive is a decentralised system that tracks collections of services and documents across multiple servers. Each collection operates its own API which exposes services and terms tracked, but the decentralisation of these APIs implies to search across all these APIs to identify which services and documents are currently tracked.

We propose the creation of a federated API to enable easy querying of the distributed database and thus facilitate collaboration with external applications.

Proposed solution

Base URL

http://api.opentermsarchive.org/:version

Endpoints

Note: The failures object is detailed below in the Error Handling section

GET /collections

Enumerate all collections

#### Returns A JSON array of all collections #### Example ``` GET /collections ``` ```json [ { "id": "collection-1", "name": "Collections 1", "languages": ["en"], "jurisdictions": ["EU"], "industries": { "en": "Online intermediation services for businesses subject to the European platforms-to-businesses (“P2B” / 2019/1150) regulation", "fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen P2B / 2019/1150" }, "url": "162.162.162.162", "maintainers": [ { "name": "Open Evidence", "url": "https://open-evidence.com/" }, { "name": "European Commission", "url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en" } ], }, { "id": "collection-2", "name": "Collections 2", "languages": ["en"], "jurisdictions": ["EU"], "industries": { "en": "Services needed to operate the Open Terms Archive engine", "fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive" }, "url": "162.162.162.162", "maintainers": [ { "name": "Open Terms Archive", "url": "https://opentermsarchive.org" } ], } ] ```

GET /services?searchName=:searchName

#### Parameters | Parameter | Type | Description | | --------- | ------ | ---------------------- | | searchName | URL-encoded string | The string to search for in service names | #### Returns A JSON array of all matching services accross all collections with the URL where they can be found. Returns all services if no `searchName` param is passed. Returns an empty array if no matching service is found. #### Example ``` GET /services?searchName=tube ``` ```json { "results": [ { "collection": "demo", "service": { "id": "peartube", "name": "PEARTUBE", "url": "http://173.173.173.173/api/v1/service/peartube", "termsTypes": [ "Terms of Service"] } }, { "collection": "contrib", "service": { "id": "yourtube", "name": "YourTube", "url": "http://162.162.162.162/api/v1/service/yourtube", "termsTypes": [ "Terms of Service", "Privacy Policy"] } } ], "failures": [] } ```

GET /service/:serviceId

A JSON array of all specific service identified by their ID in all collections

#### Parameters | Parameter | Type | Description | | --------- | ------ | ---------------------- | | serviceId | URL-encoded string | The ID of the service. | #### Returns A JSON array of services with the given ID accross all collections with the URL where they can be found. Returns a HTTP `404` if no matching service is found. #### Example ``` GET /service/service1 ``` ```json { "results": [ { "collection": "demo", "service": { "id": "service1", "name": "Service 1", "url": "http://173.173.173.173/api/v1/service/service1", "termsTypes": [ "Terms of Service"] } }, { "collection": "contrib", "service": { "id": "service1", "name": "Service 1", "url": "http://162.162.162.162/api/v1/service/service1", "termsTypes": [ "Terms of Service", "Privacy Policy"] } } ], "failures": [] } ```

Notes

Duplicates

We have considered multiple duplicate resolution solutions (specifying priority order as query params, defining an arbitrary priority based on data quality, returning an arbitrary result with a key alternatives to other results, using HTTP code 300 Multiple Choices, …) but we have come to the conclusion that they do not align with our fundamental philosophy of decentralization and resilience. The idea is therefore to embrace the fact that it is possible to have the same service declared in multiple collections and thus to always return an array of results.

Error Handling

To handle errors in the underlying APIs, the idea is to return a failures array containing objects describing the collection that failed and why. For example:

{
  "results": [
    …
  ],
  "failures": [
    {
      "collection": "demo",
      "message": "The API service encountered an internal error while processing the request.",
    },
    {
      "collection": "contrib",
      "message": "The API is currently unreachable.",
    }
  ]
}

Compatibility with different underlying API versions

By definition, a federated API may interact with multiple versions of underlying APIs. To effectively manage this, the proposed approach is to only gather the necessary fields and directly provide the resource URL in the underlying API. Moreover, to allow the client to determine the shape of the result, it is proposed to include the API version in the response headers of each underlying API.

Naming convention for collection ID

As the collection ID will then become a differentiating element that should be easy to handle with scripts and other tools, we suggest the following naming convention:

madoleary commented 11 months ago

I have a note about duplicates: I think I agree that returning all results is the best way to go, but that still leaves the question of how we'd handle duplicates on the ToS;DR side. The RFC mentions "defining an arbitrary priority based on data quality" -- what is the criteria for "data quality" in this case? Does this mean that the result with the "highest" data quality would be returned?

Is there a real-life example of duplicates that I could inspect, just to see what the returned data might look like?

Thank you!

Ndpnt commented 11 months ago

Hi @madoleary,

I have a note about duplicates: I think I agree that returning all results is the best way to go, but that still leaves the question of how we'd handle duplicates on the ToS;DR side.

The idea is to let each client of the federated API the responsibility to handle duplicates by returning all the results and letting it choose the collection from which it wants to obtain the document. I think Open Terms Archive does not aim to be an intermediary that makes crucial choices for federated API clients, such as which collection should be more reliable than another.

The RFC mentions "defining an arbitrary priority based on data quality" -- what is the criteria for "data quality" in this case? Does this mean that the result with the "highest" data quality would be returned?

As it is mentioned, the idea of "defining an arbitrary priority based on data quality" was not retained, so a priori the question of data quality criterion will not be addressed on the OTA side.

Is there a real-life example of duplicates that I could inspect, just to see what the returned data might look like?

For example, a result for a query like GET /service/facebook could look like this:

{
  "results": [
    {
      "collection": "pga",
      "service": {
        "id": "facebook",
        "name": "Facebook",
        "url": "http://173.173.173.173/api/v1/service/facebook",
        "termsTypes": [ "Terms of Service", "Privacy Policy", "Developer Terms", "Trackers Policy", "Data Processor Agreement"]
      }
    },
    {
      "collection": "contrib",
      "service": {
        "id": "facebook",
        "name": "Facebook",
        "url": "http://162.162.162.162/api/v1/service/facebook",
        "termsTypes": [ "Terms of Service", "Privacy Policy"]
      }
    }
  ],
  "failures": []
}

And on your side, you could define that you prefer to use data from the pga collection because this collection is dedicated to tracking only gatekeepers with a high quality of maintenance whereas the contrib collection has no clearly defined maintainers. Another element of choice for you could be that the pga collection has more types of terms tracked for the Facebook service. It's up to you 🙂.

madoleary commented 11 months ago

Very helpful, thank you, @Ndpnt !

MattiSG commented 11 months ago

Thanks @Ndpnt for this clear RFC!

Proposition 1.B

This is a suggested improvement of proposition 1 (initially posted) on GET /collections.

GET /collections

The provided url examples are just a hostname (162.162.162.162). I believe they should be full-fledged URLs to the base endpoint of the API (http://162.162.162.162/api) so that API calls can be programmatically written. We should also specify in the spec that it has no trailing slash.

```diff [ { "id": "collection-1", "name": "Collections 1", "languages": ["en"], "jurisdictions": ["EU"], "industries": { "en": "Online intermediation services for businesses subject to the European platforms-to-businesses (“P2B” / 2019/1150) regulation", "fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen P2B / 2019/1150" }, - "url": "162.162.162.162", + "url": "http://162.162.162.162/api", "maintainers": [ { "name": "Open Evidence", "url": "https://open-evidence.com/" }, { "name": "European Commission", "url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en" } ], }, { "id": "collection-2", "name": "Collections 2", "languages": ["en"], "jurisdictions": ["EU"], "industries": { "en": "Services needed to operate the Open Terms Archive engine", "fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive" }, - "url": "162.162.162.162", + "url": "https://api.ota.openmirrors.example/arbitrary/long/path", "maintainers": [ { "name": "Open Terms Archive", "url": "https://opentermsarchive.org" } ], } ] ```

Proposition 2

This is an alternative to proposition 1 (initially posted) on GET /services?searchName=:searchName

GET /services/search?name=:searchName

My rationale is to prefer a /services/search route with a ?name query string, as this feels more future-proof with regards to future other routes: we don't reserve query parameters at /services level, and avoid repeating search as a query parameter name if we, for example, add support for searching by ID in the future, or support fuzzy search.

#### Parameters ```diff | Parameter | Type | Description | | --------- | ------ | ---------------------- | - | searchName | URL-encoded string | The string to search for in service names | + | name | URL-encoded string | The string to search for in service names | ``` #### Returns A JSON array of all matching services accross all collections with the URL where they can be found. Returns all services if no `name` param is passed. Returns an empty array if no matching service is found. #### Example ```diff - GET /services?searchName=tube + GET /services/search?name=tube ``` ```diff { "results": [ { "collection": "demo", "service": { "id": "peartube", "name": "PEARTUBE", - "url": "http://173.173.173.173/api/v1/service/peartube", + "url": "http://162.162.162.162/api/v1/service/peartube", "termsTypes": [ "Terms of Service"] } }, { "collection": "contrib", "service": { "id": "yourtube", "name": "YourTube", - "url": "http://162.162.162.162/api/v1/service/yourtube", + "url": "https://api.ota.openmirrors.example/arbitrary/long/path/v1/service/yourtube", "termsTypes": [ "Terms of Service", "Privacy Policy"] } } ], "failures": [] } ```
madoleary commented 11 months ago

I think that the ?name query string is good suggestion

madoleary commented 11 months ago

I have another question: what would the response object look like for an index of services? For example, if I were to retrieve all the services for each collection. I ask this because eventually Phoenix is supposed to retrieve an index of services from OTA, per the MOU. Let me know if this question is outside the scope of this RFC.

madoleary commented 11 months ago

Also: is there a specific message returned when a service is not found?

madoleary commented 11 months ago

Also: is there a specific message returned when a service is not found?

Sorry, I see the HTTP 404 note!

Ndpnt commented 11 months ago

Thanks @MattiSG for your propositions.

I fully agree with the Proposition 1.B.

For proposition 2:

Ndpnt commented 11 months ago

I have another question: what would the response object look like for an index of services? For example, if I were to retrieve all the services for each collection. I ask this because eventually Phoenix is supposed to retrieve an index of services from OTA, per the MOU. Let me know if this question is outside the scope of this RFC.

As I suggest to have the search action being only a filtering on the servicescollection, for me the response object will look exactly the same. And if you need to retrieve all the services for each collection we could add a collection query string to allow filtering on the collection ID as well.

madoleary commented 11 months ago

Proposition 3

This is a suggested improvement on proposition one GET /services?name=:searchName , initially posted as GET /services?searchName=:searchName.

GET /services?name=:searchName&termsType=:termsType

The idea is to add the ability to query by termsType, so that the results can be filtered by both service name and terms type. This is to avoid having to iterate through all service results and verify their termsTypes fields at each iteration, just to locate a specific terms type within a specific service.

Details
**Parameters** | Parameter | Type | Description | | --------- | ------ | ---------------------- | | name | URL-encoded string | The string to search for in service names | | termsType | URL-encoded string | The string to search for in service terms | **Returns** A JSON array of all matching services across all collections that also include the terms type, as indicated by the `termsType` query param, in their `termsTypes` fields. Returns all matching services if no `termsType` param is passed. Returns an empty array if no matching service with the terms type is found. **Example** `GET /services?name=facebook&termsType=cookies%20policy` ``` { "results": [ { "collection": "contrib", "service": { "id": "facebook", "name": "Facebook", "url": "http://162.162.162.162/api/v1/service/facebook", "termsTypes": ["Terms of Service", "Cookies Policy"] } } ], "failures": [] } ```
Ndpnt commented 11 months ago

Hi @madoleary, Thanks for your proposition 3. I would make a minor changes by allowing to give multiple terms types like this:

Proposition 3.B

GET /services?name=:searchName&termsTypes=:termsType1,termsType2

Details
**Parameters** | Parameter | Type | Description | | --------- | ------ | ---------------------- | | name | URL-encoded string | The string to search for in service names | | termsTypes | URL-encoded string | The comma-separated string that represent the array of termsType to search for | **Returns** A JSON array of all matching services across all collections that also include the terms types, as indicated by the `termsTypes` query param, in their `termsTypes` fields. Returns all matching services if no `termsTypes` param is passed. Returns an empty array if no matching service with the terms types is found. **Example** `GET /services?name=facebook&termsTypes=Cookies%20Policy,Terms%20of%Service` ``` { "results": [ { "collection": "contrib", "service": { "id": "facebook", "name": "Facebook", "url": "http://162.162.162.162/api/v1/service/facebook", "termsTypes": ["Terms of Service", "Cookies Policy"] } } ], "failures": [] } ```
madoleary commented 11 months ago

That looks great, @Ndpnt ! I'm in favor of proposition 3.B

MattiSG commented 11 months ago

Love it!

I think it is more RESTful to think: "There is a services collection where I apply some filters"

💯

Thank you both for your contributions, I fully support 3.B!

Ndpnt commented 10 months ago

Hi everyone,

This RFC received no further feedback since one month, so I think we can conclude that proposal 3.B seems acceptable to everyone and will therefore be implemented.

Thanks again for your contributions 🙏 .

Please note that we will probably not be able to work on its implementation before a few weeks as we have a lot of things to handle this month.

MattiSG commented 10 months ago

Thanks @Ndpnt!

It's not entirely clear to me what will be implemented: 3.B is concerned with GET /services?name=:searchName&termsTypes=:termsType1,termsType2. What about GET /service/:serviceId (proposition 2? With your further amendments?) and GET /collections (1 or 1.B?)? 🤔 What is the final proposed API layout?

Ndpnt commented 10 months ago

Proposed final API layout:

GET /collections

#### Returns A JSON array of all collections #### Example ``` GET /collections ``` ```json [ { "id": "collection-1", "name": "Collections 1", "languages": ["en"], "jurisdictions": ["EU"], "industries": { "en": "Online intermediation services for businesses subject to the European platforms-to-businesses (“P2B” / 2019/1150) regulation", "fr": "Services d’intermédiation en ligne pour les entreprises sujets au règlement européen P2B / 2019/1150" }, "url": "http://162.162.162.162/api", "maintainers": [ { "name": "Open Evidence", "url": "https://open-evidence.com/" }, { "name": "European Commission", "url": "https://ec.europa.eu/info/departments/communications-networks-content-and-technology_en" } ], }, { "id": "collection-2", "name": "Collections 2", "languages": ["en"], "jurisdictions": ["EU"], "industries": { "en": "Services needed to operate the Open Terms Archive engine", "fr": "Services nécessaires au fonctionnement du moteur d'Open Terms Archive" }, "url": "https://api.ota.openmirrors.example/arbitrary/long/path", "maintainers": [ { "name": "Open Terms Archive", "url": "https://opentermsarchive.org" } ], } ] ```

GET /services?name=:searchName&termsTypes=:termsType1,termsType2

Details
**Parameters** | Parameter | Type | Description | | --------- | ------ | ---------------------- | | name | URL-encoded string | The string to search for in service names | | termsTypes | URL-encoded string | The comma-separated string that represent the array of termsType to search for | **Returns** A JSON array of all matching services across all collections that also include the terms types, as indicated by the `termsTypes` query param, in their `termsTypes` fields. Returns all matching services if no `termsTypes` param is passed. Returns an empty array if no matching service with the terms types is found. **Example** `GET /services?name=facebook&termsTypes=Cookies%20Policy,Terms%20of%Service` ``` { "results": [ { "collection": "contrib", "service": { "id": "facebook", "name": "Facebook", "url": "http://162.162.162.162/api/v1/service/facebook", "termsTypes": ["Terms of Service", "Cookies Policy"] } } ], "failures": [] } ```

GET /service/:serviceId

#### Parameters | Parameter | Type | Description | | --------- | ------ | ---------------------- | | serviceId | URL-encoded string | The ID of the service. | #### Returns A JSON array of services with the given ID accross all collections with the URL where they can be found. Returns a HTTP `404` if no matching service is found. #### Example ``` GET /service/service1 ``` ```json { "results": [ { "collection": "demo", "service": { "id": "service1", "name": "Service 1", "url": "http://173.173.173.173/api/v1/service/service1", "termsTypes": [ "Terms of Service"] } }, { "collection": "contrib", "service": { "id": "service1", "name": "Service 1", "url": "http://162.162.162.162/api/v1/service/service1", "termsTypes": [ "Terms of Service", "Privacy Policy"] } } ], "failures": [] } ```
MattiSG commented 10 months ago

Much clearer, thank you very much! 😃

MattiSG commented 7 months ago

In 3.B (https://github.com/OpenTermsArchive/engine/issues/1016#issuecomment-1658268448), we did not specify if specifying multiple terms types means we want to get only the service declarations that track all those terms types, or if we want to get all service declarations that track at least one of those terms types 🙃

@Ndpnt you were the one expanding on @madoleary’s initial request, to include multiple terms types. Do you remember what was your intention with this addition?

MattiSG commented 7 months ago

We also did not specify what happens if /services is called with no parameter at all. I suggest it sends a 400 Bad Request error, as we don't want the federated API to proceed with aggregating every existing declaration.

Ndpnt commented 7 months ago

In 3.B (#1016 (comment)), we did not specify if specifying multiple terms types means we want to get only the service declarations that track all those terms types, or if we want to get all service declarations that track at least one of those terms types 🙃

@Ndpnt you were the one expanding on @madoleary’s initial request, to include multiple terms types. Do you remember what was your intention with this addition?

My intention was to make it possible to search for a service containing at least the specified terms types, in order to help me find the most appropriate collection for the terms types I was interested in. So for me, it was an AND logical operator for terms types.

Ndpnt commented 7 months ago

We also did not specify what happens if /services is called with no parameter at all. I suggest it sends a 400 Bad Request error, as we don't want the federated API to proceed with aggregating every existing declaration.

I don't agree with that, I'm in favor of returning all the services. At the moment, we don't have too many services, and when we do, we'll be able to set up pagination. It's important to bear in mind that this means just one request to each collection API and not a request per service.

Ndpnt commented 7 months ago

After some discussion, it seems that we don't currently have a use case for searching with multiple term types on /services, so we'll revert to a single termsType parameter.

MattiSG commented 7 months ago

If we have no results but all collections have failures, is that still a 404 or is that a 502 at some point? 🤔

MattiSG commented 7 months ago

it seems that we don't currently have a use case for searching with multiple term types

Complement note: we also found that all hypothetical use cases (AND, OR) could be implemented with the basic function provided here and a tiny bit of client-side logic. It will always be time to add more power to the API later on when we gather more understanding of most usual use cases 🙂

MattiSG commented 7 months ago

I don't agree with that, I'm in favor of returning all the services. At the moment, we don't have too many services

After discussion I agree, this was premature optimisation on my side. This “no parameter” route is very easy to cache. If it becomes very popular and the contents grow big, we can just decrease the poll rate and warn that this route only updates every hour / every day…

madoleary commented 7 months ago

Hi all, I appreciate the discussion about multiple terms types. In my specs, I only have us searching for one terms type at a time, e.g., cookies policy. I, too, don't think searching for multiple terms types is necessary. I also think all services should be returned on /services. I think that's more like the RESTful behavior I've seen.