andersju / webbkoll

An online tool that checks how a website is doing with regards to privacy
MIT License
266 stars 28 forks source link

. #14

Closed ghost closed 3 years ago

andersju commented 5 years ago

Hi! So far we've deliberately not provided an API due to limited server resources, but it would be a good idea to add support for that in the code so that it'd be a possibility for anyone who wants to run their own instance. It's on my TODO! (Along with Dockerizing to make it trivial to get the whole thing going)

You can by the way can already run the "backend" part easily: https://github.com/andersju/webbkoll-backend -- however this only gives you a JSON with the "raw" stuff (headers, cookies, etc.), without any analysis. Might be a good idea to move the analysis part to JS, actually. Maybe a small library which could then be used by both the server and perhaps a future CLI tool. Hmm.

andersju commented 5 years ago

For sure, I just imagine that if we had an API on Webbkoll people might use it for things like "check list of 1000 websites" (it's what I would do! :)) and different kinds of automated checking. Which would be great, but I think it could get out of hand given our current resources. But I'll think of something.

andersju commented 3 years ago

I just added a very basic experimental API, currently only available when running in dev mode.

Start a new scan:

curl -X POST 'http://localhost:4000/api/check/?url=http://example.com'

You'll get a reply with something like:

{
  "data": null,
  "id": "8963f1b0-8db9-44d2-85c3-95303ad708c4",
  "input_url": "http://example.com",
  "inserted_at": 1605484699780151,
  "status": "queue",
  "status_message": null,
  "try_count": 0,
  "updated_at": null
}

status is one of "queue", "processing", "failed" or "done". You can take the returned id and GET that particular scan:

curl 'http://localhost:4000/api/check/?id=8963f1b0-8db9-44d2-85c3-95303ad708c4'

If status is "failed", you'll have the error message in status_message. If it's "done", you'll see a new key, data, with the results:

{
  "data": {
    "cookie_count": {
      "first_party": 0,
      "third_party": 0
    },
    "cookie_domains": 0,
    "cookies": {
      "first_party": [],
      "third_party": []
    },
    "csp": {

etc. Sending a GET request like this:

curl 'http://localhost:4000/api/check/?url=http://example.com'

...will give you the latest result for that URL, if it has been checked already. If it hasn't, you'll get an error.

Sending a POST request to check a site that's already been checked will also give you the latest result for that URL in return, unless you add &refresh=on to the query, which will always start a new scan:

curl -X POST 'http://localhost:4000/api/check/?url=http://example.com&refresh=on'

Keep in mind: this is experimental and the format/structure of the result is messy, subject to change at any moment, and wasn't designed with an API in mind.

I'm planning to rewrite the JS "backend", which currently does very little except run Chromium and return some raw results, to also do all the analysis - perhaps putting a lot of that into a JS library (JS being the most pragmatic choice given our dependency on Puppeteer). This would also make it easy to make e.g. a standalone CLI version that people can run locally.