bear / indie-stats

Indieweb site crawler and MF2 data collection tool
MIT License
11 stars 1 forks source link

index raw data for use by api #8

Open snarfed opened 8 years ago

snarfed commented 8 years ago

i often have arbitrary questions that i'd love to use this data to answer, e.g. how many sites have an h-card?, or how many people use PSCs/PSLs? any chance you might serve the data publicly? maybe a zip file per day and/or per site?

bear commented 8 years ago

right now I am collecting everything as discrete json files:

https://indie-stats.com/domains/{{ basedomain }}/processed.json contains a list of all poll results https://indie-stats.com/domains/{{ basedomain }}/{{ timestamp }}_{{ basedomain }}.json

for example:

processed.com

["20150311T082741_tantek.com.json", "20150224T083010_tantek.com.json"]

20150311T082741_tantek.com.json:

{
  "status": 200,
  "headers": {
    "content-length": "26734",
    "x-powered-by": "PHP/5.3.28",
    "content-encoding": "gzip",
    "vary": "Accept-Encoding, User-Agent",
    "server": "LiteSpeed",
    "connection": "close",
    "date": "Wed, 11 Mar 2015 08:27:41 GMT",
    "content-type": "text/html; charset=UTF-8"
  },
  "domain": "tantek.com",
  "html": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n</head></html>"
  "excluded": false,
  "url": "http://Tantek.com/",
  "polled": "2015-03-11T08:27:41Z",
  "claimed": false,
  "mf2": {},
  "history": [ 200, 200 ],
}

where obviously the mf2 and html entries have a lot more data

bear commented 8 years ago

you can now get the above using the beginnings of an api

first make the /api/v1/domains/ call to get the domain's data and then within that json you will find the list of processed items stored in the "history" key - that json file can be pulled by direct request

will add a tarball to the processing code

snarfed commented 8 years ago

cool! thank you!

snarfed commented 8 years ago

to expand on the initial description, here are concrete questions questions i'd love to be able to ask:

original discussion in IRC.

bear commented 8 years ago

cool - i'll start chewing on this and ping you when I have the start of it