dogsheep / hacker-news-to-sqlite

Create a SQLite database containing data pulled from Hacker News
Apache License 2.0
47 stars 6 forks source link

Use HN algolia endpoint to retrieve trees #3

Open simonw opened 3 years ago

simonw commented 3 years ago

The trees command currently has to make a request for every single comment. Algolia have an endpoint that bundles the entire thread together into a single request.

https://hn.algolia.com/api/v1/items/ID

Here's an example that loads quickly, with about 50 comments: https://hn.algolia.com/api/v1/items/27941108

It doesn't appear to use pagination at all - if a thread is big then the response is big.

I ran this search to find some stories with more than 1000 comments: https://hn.algolia.com/api/v1/search?tags=story&numericFilters=num_comments%3E=1000

Here's one: https://news.ycombinator.com/item?id=25015967 with 4759 comments. Hitting the API takes 41s and returns 3.7 MB of JSON!

wget 'https://hn.algolia.com/api/v1/items/25015967'  0.03s user 0.04s system 0% cpu 41.368 total
/tmp % ls -lah 25015967 
-rw-r--r--  1 simon  wheel   3.7M Jul 24 20:31 25015967
simonw commented 3 years ago

Prototype:

curl 'https://hn.algolia.com/api/v1/items/27941108' \
  | jq '[recurse(.children[]) | del(.children)]' \
  | sqlite-utils insert hn.db items - --pk id
simonw commented 3 years ago

If you hit the endpoint for a comment that's part of a thread you get that comment and its recursive children: https://hn.algolia.com/api/v1/items/27941552

You can tell that it's not the top-level because the parent_id isn't null. You can use story_id to figure out what the top-level item is.

{
  "id": 27941552,
  "created_at": "2021-07-24T15:08:39.000Z",
  "created_at_i": 1627139319,
  "type": "comment",
  "author": "nine_k",
  "title": null,
  "url": null,
  "text": "<p>I wish ...",
  "points": null,
  "parent_id": 27941108,
  "story_id": 27941108
}
simonw commented 3 years ago

Got a TIL out of this: https://til.simonwillison.net/jq/extracting-objects-recursively