Duplicate entries - Githubissues

ByteHamster commented 4 years ago

A big problem of public podcast directories is that they can quickly fill up with duplicates. This can be observed in the wild on gpodder.net (GitHub), where most search terms return a pretty big number of old or broken feeds, as well as unofficial mirrors.

Does Podcastindex.org have a strategy on how to deal with duplicate feed submissions?

adamc199 commented 4 years ago

I'll leave technical explanations to Dave, but we have 10 years of rss aggregation experience and have written many exceptions etc to combat this.

ghost commented 4 years ago

I don't see duplicates being a real big issue. They just have to be managed. There are several ways, and of course you're never going to get them all.

Regarding Broken Feeds: I'm not sure if Dave shared this with everyone, but I believe broken feeds will be removed after a specific number of tries over a period of time.

Regarding Duplicates: The goal of the whole project isn't to become a gate keeper of the podcasting world - in fact it's just the opposite. Duplicate feeds can arise from many factors. For example, when iTunes cracked down on keyword stuffing many podcasts got removed, and then were re-added. Some directories didn't remove the deleted feeds. So that created a large amount of duplicates. And then you're going to get intentional duplicates. Where the entire feed is pretty much the same but the feed URL is different. These are pretty easy to identify based on the episode titles, release date, and episode length. The easiest way to detect would be to set the feed url as a unique key. Not a perfect solution, and of course people can add query strings to circumvent the system. Then there are what I call partial duplicates. I learned this weekend that there are just about 96,000 episodes duplicated all related to Dungeons and Dragons. I suspect these are fans that create custom feeds based on favorite game action. They don't duplicate the whole feed, but just select episodes. Should they be removed -- I think not. But to answer your question, yes duplicate and dead feeds will be addressed.

daveajones commented 4 years ago

Thanks for addressing this Mike. Correct on all fronts.

Broken feeds: Each feed has an "errors" counter. This counter is incremented based on the severity of the error. Each time the feed is pulled (downloaded), this happens. There is also a "parse_errors" counter for the same purpose on the parser side of things. The worse the error, the faster this counter goes up. When the puller error count tops 100, it gets marked as "dead" and the aggregators stop pulling the feed regularly, and it gets relegated to a single "error" aggregator that just gives best effort. Most errors just increment by 1. ENOTFOUND, ECONNREFUSED increment by 10. 4xx statuses increment by 4. 5xx http statuses increment by 5. If the best effort aggregator ever brings a feed back from the dead all the counters are reset to zero.

Duplicates: I haven't worried to much about this so far. As Mike says, they should be fairly easy to spot by just doing comparisons. We'll have a script at some point that will sweep across and check for the obvious ones. I'm about to create a new API endpoint listing recent feeds added to the index. That'll be a good firehose for checking shenanigans too. I'm open to any and all bright ideas on this front.

daveajones commented 4 years ago

For more clarity on the duplicates issue, we require a "write" enabled developer key (must be approved) to submit new feeds to the index. Quite a few networks and platforms are auto-submitting their shows to us now, so the index is very clean. We aren't going to make this a free-for-all where someone can go rogue and just dump 5000 clone feeds in there. Nobody wants that. I've just finished up a roll-back feature where every addition gets attributed to a key and can be rolled back in batches if necessary. We want things very clean and we're moving slow to make sure that happens.

ByteHamster commented 4 years ago

Thank you very much for all the replies.

We want things very clean and we're moving slow to make sure that happens.

This is what I hoped to hear. Duplicates like unofficial mirrors or old feeds without redirect (still returning a valid feed) can make the search function pretty much unusable for average users - at least from my experience with the gpodder.net search feature.

Podcastindex-org / docs-api

Duplicate entries #17