Podcastindex-org / database

16 stars 6 forks source link

Efficient ways to sync with the Podcast Index database #33

Open ryan-lp opened 6 months ago

ryan-lp commented 6 months ago

In https://github.com/Podcastindex-org/podcast-namespace/discussions/558 I wrote:

As for what I'm currently doing (downloading the whole database every week, computing a diff, then integrating that), I feel this could be streamlined. [...]

@daveajones replied:

There are a bunch of ways to stay up to date actually. I’m glad to share them. [...] This should probably be in the “database” repo instead of in the namespace repo.

I'm moving that discussion here and would be interested in the bunch of ways you mentioned. I'm personally interested in ways that don't hit the API server in part due to https://github.com/Podcastindex-org/legal/issues/1 which prohibits building databases out of content returned from the API. I think ideally we want an efficient and permissible way to create mirror databases, not only to improve locality but to facilitate mirroring and prevent a single point of failure.

ryan-lp commented 6 months ago

In order to keep the mirror up to date, it would be helpful to have a diff indicating insertions, deletions, updates. I'm not talking about updates to the feed contents, but updates to the feed identity (feed URL, itunes ID, ...). This sort of mirroring has some parallels with the way mirrors are created for Linux distributions using rsync to only transfer what has changed, although in practice, a podcast index mirror DB might either be an exact replica or it might be a custom DB with extra columns. As long as it has the same primary key, the diff approach will still work. Since there are straightforward instructions on how to create a Linux distribution mirror, there are many Linux mirrors and no single point of failure. My Arch Linux mirrorlist file has 500 alternative mirror sites in it.

In theory, the podping network could also be used to broadcast "insertions" at least. The podcast index might then publish guidelines on how to independently detect deletions and updates (i.e. to the identity) on their own. Although this approach might need to involve adding the iTunes ID to the podping message. Ideally it would be in the feed content anyway but that is unlikely to be a realistic option in the near to medium term.

ryan-lp commented 6 months ago

Although this approach might need to involve adding the iTunes ID to the podping message.

I suppose an alternative would be to leave the podping message format the way it is, so just broadcasting the feed URLs, and then rely on the iTunes API to look up the iTunes ID whenever a new podcast appears. There's no official API to lookup an iTunes ID by feed URL, but you can lookup by title and get a set of results, then iterate over those results to match the feed URL.

GET https://itunes.apple.com/search?term=PODCAST_TITLE&attribute=titleTerm&entity=podcast

There is a limit of 20 API calls per minute, so this assumes new podcasts are created at a rate no greater than 20 per minute.