OpenTermsArchive / engine

Tracks contractual documents and exposes changes to the terms of online services.
https://opentermsarchive.org
European Union Public License 1.2
105 stars 29 forks source link

Obtain `location` dynamically from link #1039

Open jetlime opened 6 months ago

jetlime commented 6 months ago

In some sites such as the linkedin transparency reports, the terms of interest are located in dynamically named endpoints that could for example be determined by time (e.g. October-2023-LinkedIn-DSA-Transparency-Report10.pdf). These dynamic endpoints of interest are in most cases located in fixed locations. It thus makes sense to introduce the new declaring term dynamic-fetch.

This term will fetch the document located on the dynamic endpoint dynamic-fetch.variable defined at dynamic-fetch.location. It will be complimentary to fetch.

It could potentially be defined as follows,

{
  "name": "Linkedin",
  "documents": {
      "Transparency Ad Library": {
          // This shall fetch the pdf doc at https://content.linkedin.com/content/dam/help/linkedin/en-us/October-2023-LinkedIn-DSA-Transparency-Report10.pdf
           "dynamic-fetch": {
             "variable": "div[class=\"t-14 article-content__rich-text hue-default-color\"] > ul > li:first-child > a.getAttribute('href')",
             "location": "https://www.linkedin.com/help/linkedin/answer/a1678508?hcppcid=search"
           }
      }
  }
}

As I am pretty new to this tool, I would be happy to hear some feedback about this proposition! If you share my vision, I would be happy to implement it :)

MattiSG commented 3 months ago

Thanks @jetlime for this suggestion!

Indeed, it happens sometimes that terms are only available as a downloadable file behind a link. The idea of obtaining the URL dynamically from the DOM is a smart answer to that problem 👍

The main question we need to answer to decide if it would be worth adding a new type of fetch is: are the location and DOM from which we obtain the link any more stable than the link itself? In the case at hand, DSA Transparency Reports are published every 6 months. We'd need to demonstrate that the location and DOM from which the link can be obtained change significantly less often than twice a year, otherwise the maintenance burden will be the same on collection maintainers, and we would have increased software complexity for nothing 😰

The next investigation steps I see are:

  1. Identify at least 2 other cases where such a system would be used.
  2. Measure with the Wayback Machine (or any other reliable history tool) how often the location or link selector changed (l) vs how often the target of the link changed (t) in at least the last 2 years.

If t > e ⨉ l, where e is some arbitrary multiplier encoding the effort it would take to implement this feature, we'll consider it 🙂