Podcastindex-org / database

19 stars 6 forks source link

Detecting duplicates via atom:link self #39

Open ryan-lp opened 8 months ago

ryan-lp commented 8 months ago

The following two podcasts in the DB are duplicates:

id|itunesId|url
616|1002532108|http://feeds.feedburner.com/PostScriptPodcast
2579183||https://feeds.soundcloud.com/users/soundcloud:users:156648456/sounds.rss

The first one has an itunesId, the second one doesn't. The first one also has this XML inside the feed:

        <atom:link href="https://feeds.soundcloud.com/users/soundcloud:users:156648456/sounds.rss" rel="self" type="application/rss+xml"/>

So this can be used to link the two together. Since they don't have two separate itunesIds, they can be collapsed into one row.

ryan-lp commented 8 months ago

Digging deeper, the listed iTunes ID is invalid when I query the iTunes API, so that's stale. Also, neither feed actually contains any episodes either, so maybe both could be deleted.

But checking the atom:link rel=self could still be used to detect some duplicates.