OpenTermsArchive / engine

Tracks contractual documents and exposes changes to the terms of online services.
https://opentermsarchive.org
European Union Public License 1.2
105 stars 29 forks source link

Define a collection metadata API to facilitate collaboration with external applications #1003

Closed Ndpnt closed 1 year ago

Ndpnt commented 1 year ago

Context and Problem Statement

Open Terms Archive (OTA) is a decentralised system that tracks collections of services and documents across multiple servers. Each collection has its own public repository where services and documents declarations are stored. The decentralisation of OTA presents a challenge when it comes to easily identifying which services and documents are currently being tracked.

This can complicate collaborative efforts with external applications, such as Terms of Service; Didn't Read (ToS;DR), whose web application will be adapted to obtain data from public OTA datasets instead of the ToS;DR server database. When users of the application attempt to add a new document, the system must be able to inform them whether the document already exists in an OTA collection or not and in which one.

To address this problem, we propose the creation of an API that allows easy access to the metadata of each OTA collection and thus facilitate collaboration with external applications.

This RFC outlines the details of the proposed collection metadata API.

Proposed solution: Collection metadata API

Base URL

<collection host>/api/:version

Endpoints

GET /services

Retrieve all services, with optional query parameters

#### Query parameters | Parameter | Type | Description | | --------- | ------ | ---------------------------------------------------------------- | | name | string | The name of the service to fuzzy search | | terms_type | string | The type of the terms to search for within the services' terms. | #### Returns A JSON array of all services including all their terms. An empty array if no services match the provided query parameters. #### Example ``` GET /services ``` ```json [ { "id": "service1", "name": "Service 1", "terms": [ { "type": "Terms of Service" }, { "type": "Privacy Policy" } ] }, { "id": "service2", "name": "Service 2", "terms": [ { "type": "Terms of Service" } ] } ] ``` #### Another example with query param ``` GET /services?name=Service%201 ``` ```json [ { "id": "service1", "name": "Service 1", "terms": [ { "type": "Terms of Service" }, { "type": "Privacy Policy" } ] } ] ```

GET /services/:serviceId

Retrieve a specific service by ID

#### Parameters | Parameter | Type | Description | | --------- | ------ | ---------------------- | | serviceId | string | The ID of the service. | #### Returns A JSON object representing the service with the specified ID, including its terms. Returns a HTTP `404` if no service is found. #### Example ``` GET /services/service1 ``` ```json { "id": "service1", "name": "Service 1", "terms": [ { "type": "Terms of Service" }, { "type": "Privacy Policy" } ] } ```

GET /services/:serviceId/terms

Retrieve all terms included in the specified service

#### Parameters | Parameter | Type | Description | | --------- | ------ | ---------------------- | | serviceId | string | The ID of the service. | #### Returns A JSON array of all terms included in the specified service Returns a HTTP `404` if no service is found. #### Example ``` GET /services/service1/terms ``` ```json [ { "type": "Terms of Service" }, { "type": "Privacy Policy" } ] ```

GET /services/:serviceId/terms/:termsType

Retrieve a specific terms within a specific service by its type

#### Parameters | Parameter | Type | Description | | --------- | ------ | ---------------------- | | serviceId | string | The ID of the service. | | termsType | string | The terms type. | #### Returns A JSON object representing the terms. Returns a HTTP `404` if no service or terms is found. #### Example ``` GET /services/service1/terms/Privacy%20Policy ``` ```json { "type": "Privacy Policy" } ```

Note

Here is an example with the location of each source documents that constitute a terms:

GET /services/service1
{
  "id": "service1",
  "name": "Service 1",
  "terms": [
    {
      "type": "Terms of Service",
      "sourceDocuments": [
        {
          "location": "https://service1.com/tos-1"
        },
        {
          "location": "https://service1.com/tos-2"
        }
      ]  
    },
    {
      "type": "Privacy Policy",
      "sourceDocuments": [
        {
          "location": "https://service1.com/privacy-policy"
        }
      ]
    }
  ]
}
Kissaki commented 1 year ago

The GET /services/service1 example is different from the closing GET /services/service1 example at the end. I'm a bit confused about whether all the endpoints really return an array of objects with key type and string value?

Is sourceDocuments omitted from the other endpoints? Only from the examples? Omitted when empty?


Is there even a need for sub-endpoints when /services/:serviceId data is (presumably) small enough that it could always serve all terms with type and doc locations? Omitting them could simplify the interface and reduce upcoming maintenance if deemed viable. For consumers it'd mean less structure parsing considerations too.

If deemed useful query parameters could serve for filtering (?termtype=Privacy Policy - replacing the sub endpoints with different result structure of this proposal.

Ndpnt commented 1 year ago

Hi @Kissaki and thank you for taking the time to participate in this RFC 😃

The GET /services/service1 example is different from the closing GET /services/service1 example at the end. I'm a bit confused about whether all the endpoints really return an array of objects with key type and string value? Is sourceDocuments omitted from the other endpoints? Only from the examples? Omitted when empty?

The examples are different because in the first example I deliberately include only the minimal attributes of the services and terms objects, name for services and type for terms respectively, but we could add more attributes if it seems relevant. In the last example, I added sourceDocuments to the terms object to give an example of the type of terms attributes that could be added to the response if necessary.

In the end, if we actually only keep type for terms in the response, it could be:

GET /services/service1
{
  "id": "service1",
  "name": "Service 1",
  "terms": [ "Terms of Service", "Privacy Policy"]
}

Here is an example of a response with all the attributes to know what is available.

GET /services/service1
{
  "id": "service1",
  "name": "Service 1",
  "terms": [
    {
      "type": "Terms of Service",
      "sourceDocuments": [
        {
          "location": "https://service1.com/tos-1",
          "executeClientScripts": false,
          "contentSelectors": "#main",
          "insignificantContentSelectors": ".returnToTop",
          "filters": ["cleanUrls"],
        },
        {
          "location": "https://service1.com/tos-2",
          "executeClientScripts": false,
          "contentSelectors": "#main",
          "insignificantContentSelectors": ".returnToTop",
          "filters": ["cleanUrls"],
        }
      ]  
    },
    {
      "type": "Privacy Policy",
      "sourceDocuments": [
        {
          "location": "https://service1.com/privacy-policy"
          "executeClientScripts": true,
          "contentSelectors": "body",
          "insignificantContentSelectors": ".returnToTop",
          "filters": ["cleanUrls"],
        }
      ]
    }
  ],
  "filters": "function cleanUrls(document) {…}"
}
Ndpnt commented 1 year ago

I specify that the deadline is 24/04 end of day AoE (Anywhere on Earth).

MattiSG commented 1 year ago

Thank you very much @Ndpnt for opening this RFC and for this first clear proposal!

Feedback on proposition A

Base URL

I believe we should allow mounting the route under any arbitrary path, since several collections might be made available on a single host.

- <collection host>/api/:version
+ <collection host>[/optional/path]/api/:version

GET /services

GET /services/:serviceId

GET /services/:serviceId/terms/:termsType

  | Parameter | Type   | Description            |
  | --------- | ------ | ---------------------- |
  | serviceId        | string | The ID of the service. |
- | termsType        | string | The terms type. |
+ | termsType        | URL-encoded string | The terms type. |

Both for GET /services/:serviceId/terms and for GET /services/:serviceId/terms/:termsType, I second @Kissaki: I fail to see the added value for consumers, and see how the maintenance, testing and complexity would increase.

Unless we have a very clear way to do “fuzzy search”, I would also not be shocked if the API left it to the consumers to implement search, and only enabled two things:

  1. /services, for enumerating IDs and names.
  2. /service/:serviceId, for getting the full data of a given service.

I thus offer an alternative proposition below.

MattiSG commented 1 year ago

Proposition B

Base URL

<collection host>[/path]/api/:version

Endpoints

GET /services

Enumerate all services.

#### Returns A JSON array of all services. #### Example ``` GET /services ``` ```json [ { "id": "service-1", "name": "Service/1" }, { "id": "service-2", "name": "Service/2" } ] ```

GET /service/:serviceId

Retrieve the declaration of a specific service through its ID.

#### Parameters | Parameter | Type | Description | | --------- | ------ | ---------------------- | | serviceId | URL-encoded string | The ID of the service. | #### Returns The full JSON declaration of the service with the given ID. Returns a HTTP `404` if no matching service is found. #### Example ``` GET /service/service-1 ``` ```json { "id": "service-1", "name": "Service/1", "terms": [ { "type": "Terms of Service", "sourceDocuments": [ { "location": "https://service1.com/tos-1", "executeClientScripts": false, "contentSelectors": "#main", "insignificantContentSelectors": ".returnToTop", "filters": ["cleanUrls"], }, { "location": "https://service1.com/tos-2", "executeClientScripts": false, "contentSelectors": "#main", "insignificantContentSelectors": ".returnToTop", "filters": ["cleanUrls"], } ] }, { "type": "Privacy Policy", "sourceDocuments": [ { "location": "https://service1.com/privacy-policy" "executeClientScripts": true, "contentSelectors": "body", "insignificantContentSelectors": ".returnToTop", "filters": ["cleanUrls"], } ] } ], "filters": "function cleanUrls(document) {…}" } ```
Ndpnt commented 1 year ago

Thank you @MattiSG for your relevant feedback 🙂.

Base URL

I believe we should allow mounting the route under any arbitrary path, since several collections might be made available on a single host.

- <collection host>/api/:version
+ <collection host>[/optional/path]/api/:version

👍

GET /services

  • I understand we would like “fuzzy search”, but this makes it potentially very complicated as we'd need to specify how the “fuzziness” works. How do you view this fuzziness factor? 🙂

I was thinking of dealing with case and accents. For example to allow easily find if services like YouTube or GitHub exist in the collection even if in the request the name was not quite right: GET /services?name=Youtube or GET /services?name=github.

  • Currently, the name constraints are very few. How would non-ASCII characters and URL-meaningful characters be handled?

I was thinking of processing them by encoding them. For example: https://example.com/path/with/éhttps://example.com/path/with/%C3%A9

  • I prefer type over terms_type as a query term, since in this context it is probably clear enough.

👍

  • The result could end up being pretty big. What about adding pagination or, in a simpler way, simply returning a list of IDs and names, leaving it to the consumer to query the /services/:serviceId endpoint for additional information? 🙂

I'm in favor or of simply returning a list of IDs and names.

GET /services/:serviceId

  • Currently, the serviceId constraints are based on filesystem constraints, not on URI constraints. How would non-ASCII characters and URL-meaningful characters be handled?

By encoding them.

GET /services/:serviceId/terms/:termsType

  | Parameter | Type   | Description            |
  | --------- | ------ | ---------------------- |
  | serviceId        | string | The ID of the service. |
- | termsType        | string | The terms type. |
+ | termsType        | URL-encoded string | The terms type. |

Both for GET /services/:serviceId/terms and for GET /services/:serviceId/terms/:termsType, I second @Kissaki: I fail to see the added value for consumers, and see how the maintenance, testing and complexity would increase.

The idea was to return only the information for a specific terms type. I agree with both of you when only the minimal attributes of the terms are returned, but I think it can be valuable if the full terms object is returned.

Ndpnt commented 1 year ago

In the first stage, I think that the simple proposal B is very good. I'm just a little concerned that external applications might consider a service not declared because of a small case error in the service name.

But I really like the idea of keeping things simple, so I vote for this proposal and we'll see with our collaboration with ToS;DR if we run into problems 🙂

madoleary commented 1 year ago

Proposal B sounds good to me

Ndpnt commented 1 year ago

The deadline has expired, thank you all for your participation. The proposal B receiving the most approvals, so this is the one that will be implemented. 🙂