Podcastindex-org / podcast-namespace

A wholistic rss namespace for podcasting
Creative Commons Zero v1.0 Universal
371 stars 111 forks source link

Strange podping activity #441

Closed ryan-lp closed 1 year ago

ryan-lp commented 1 year ago

The following observations were not rare, and so if you watch the activity over a couple of minutes, you will likely be able to observe similar examples.

While watching the podping stream, I saw the newsradio1620 appear multiple times in succession. More strange, the URL takes me to a castos login page, and so should probably not have been broadcast via podping:

http://newsradio1620.com/feed/podcast
http://newsradio1620.com/feed/podcast
http://newsradio1620.com/feed/podcast
https://blackcollegenines.com/feed/podcast/
http://newsradio1620.com/feed/podcast
https://khilafah.news/ur/?feed=podcast/
http://newsradio1620.com/feed/podcast
http://newsradio1620.com/feed/podcast

All pings by heliumradio.com point to invalid XML documents which are actually 2 XML documents concatenated together:

https://heliumradio.com/feed/podcast/benja-welldone-comedy-show/

media.rss.com pings seem to be a bit spammy, for example, with the following URL being pinged 56 times over a couple of minutes (possibly suggesting a bug in rss.com):

https://media.rss.com/outofthedryingpan/feed.xml

Each time a ping occurs, you might also get a sort of Slashdot effect with all podcast apps trying to download the same feed at the same time, and this would be amplified if the ping-er is sending many duplicate pings for the same feed.

Perhaps this suggests some appropriate usage guidelines that pingers should agree to when joining the network (don't spam redundant events, ping only valid feeds), and appropriate usage guidelines for ping watchers so that they buffer pings and schedule the fetch for a bit later, adding a bit of a diffusion effect. E.g. a ping watcher might want to buffer up the pings for the same host and then send a batch request to that host when the buffer is full, or when the buffer has been stagnant.

albertobeta commented 1 year ago

@ryan-lp thanks for pointing that out. We recently removed email addresses from all our RSS feeds. Prior to proceeding with that mass RSS feed update, we spoke with @daveajones and we decided to temporarily disable podping. It is possible, however, that a few Podping requests were sent anyway (e.g. by old k8s pods that were still active at the time of the change). It occurs to me that this could be an artifact produced by that mass update. In any case, spammy and spam are not the most appropriate labels, given that we are all striving to innovate in the same industry and we are all on the same boat rowing in the same direction. This is rather a matter of defining guidelines and best practices. In the absence of official guidelines (to the best of my knowledge), our implementation of Podping at RSS.com is extremely simple: each time we update an RSS feed, we call Podping. So in your example, if we did update a given RSS Feed 56 times in a couple of minutes (it's not common, but there are a few reasons why this could have happened), then we called Podping 56 times. If we want to implement rate limiters or more sophisticated rules, we can certainly do that. But we need to define criteria and best practices first. For instance, when last week we disabled Podping before a mass update over 24k RSS feeds that we host, we proactively reached out to Dave as a "courtesy" to align and to avoid sending too many requests. But, theoretically speaking, shouldn't have we pinged podping 24k times instead? After all, we really updated those RSS feeds. Good points though. Probably we need better guidelines for the "pingers" and "ping watchers". The only drawback I can see in making the implementation more complex is that this would add a bit more friction to the adoption of podping for new joiners.

ryan-lp commented 1 year ago

If 24k RSS feeds are updated, then I think it makes perfect sense to send 24k pings because the ping watchers will inevitably need to reflect all of those changes. It is only a question of whether you want to spread those pings out over time to avoid a Slashdot effect hitting back at your own server. So if there are 100 ping watchers, 24k pings will soon be followed by 2.4 million requests to your server, all of which are ultimately necessary, but you would have the power to spread those out if you needed to spread the load over time.

But as for the 56x pings for same feed over 2 minutes, this is definitely worth optimising because each ping will trigger the entire feed processing pipeline for 100 ping watchers (using the example number I gave earlier), which can in some cases constitute significantly more load than the pinger itself. If ping watchers would know in advance that there were going to be 56x pings of the exact same feed URL over the next 2 minutes, we might prefer to just wait for the final ping, and then do our processing pipeline once. But the watchers can't really know that in advance. So if the pinger does know that it is about to ping the same URL 56 times over the next 2 minutes, say as part of some sort of collective update operation, we might be able to imagine a guideline for how best to go about that. Perhaps something along these lines:

AVOID:

for each episode in feed:
    modify(episode)
    ping(feed)

GOOD:

for each episode feed:
    modify(episode)

ping(feed)

(P.S. I meant spam in the Monty Python sense, or simply repeatedly broadcasting the same message multiple times in a row. You are right, we're rowing in the same direction, so my apologies for any negative connotations arising from that choice of word.)

brianoflondon commented 1 year ago

The system can cope with a mass pinging of a large block, in fact, it is more efficient if you can send a batch of pings at once. Dave's part of the system podping.cloud will batch these and as a result will send out fewer podpings with many IRIs (up to around 130) and this is more efficient. Additionally, Podping.cloud will de-duplicate everything it receives (I'm not sure exactly on what time frame Dave does the aggregation.

The hivewriter software works in line with Hive's 3s blocks so even though that will de-duplicate, in practice, it doesn't seem much duplication in under 3s.

BTW here's an example of that in production: https://hive.ausbit.dev/tx/78ba9473715003597b4bf10ce52bd62225225e48 Buzzsprout batch up their changes and this single podping carried 178, the current world champion podping!

3speak recently pinged all their (smaller number) of feeds and nobody but me noticed though this goes through a be-spoke system which I operate (not podping.cloud).

Sometimes I think podcast creators actually do make multiple edits of a feed correcting typos and stuff over the course of a few minutes, 56 is on the high end but the whole system seems to work.

albertobeta commented 1 year ago

It's ok to call it spam if it is in the Monty Python sense ;) We checked the logs for the RSS feed we host that you mentioned in your message above. It was not updated 56 times, but actually 90 times in a time span of 36 minutes (between 3.50AM and 4.26AM UTC) and not 2 minutes. The person in question was fine tuning their podcast description by adding and removing information and each time they were saving the description thus triggering an update of the RSS feed, with a consequent podping. Each update of the feed sends a podping task to our queue system that is typically executed within 1 second. When you say that podping received all these requests in a couple of minutes how many minutes exactly? It should be 36 ballpark. Perhaps we can introduce a buffer logic to group changes and delay calling podping, but it would be great to decide together which criteria this logic should follow so everyone implements it the same way. If other hosting companies implemented podping like us, then a podcaster editing 96 times their podcast description would trigger 96 feed updates and 96 podping requests. It would be great to compare notes with others on how they approach this.

brianoflondon commented 1 year ago

I'm getting these from the back end database of Pingslurp which records all the podpings.

Started: 2023-02-19T02:52:54.000+00:00 and ended them 2023-02-19T03:28:54.000+00:00

And I think it was around 74 pings. :-)

I think those are UTCs but I'm not going to quibble over an hour, seems to match up with yours.

Hive can handle this, there might well be some utility to delaying pings a little, especially for feed updates where timing isn't super critical.

jamescridland commented 1 year ago

A suggestion for an algorithm (as a starting point) that might work for everyone...

On a change of an RSS feed... Has there been another change to this feed in the last five minutes? NO: then podping it! YES: then...

  1. If it is an addition or deletion of an audio file, podping it!
  2. If it is a metadata change, schedule a podping for 5 minutes time. No matter how many changes happen in the next five minutes, ensure that only one podping is sent.

This should mean that new episodes publish immediately, but otherwise changes occur within five minutes, and podpings are not made more often than five minutes (except for new/changed metadata).

OR

Podping consumers (directories) might just run a simple queue that does the same sort of thing - watches for frequently updated feeds and puts them in a queue for single updates. (Harder this time to know if it's a new episode though).

The podping system is certainly capable of rapid frequent updates; but many downstream services may not be. I think it's easily fixed with a sensible algorithm. Podnews doesn't use podping (for a variety of reasons) and scaling concerns are certainly a factor here.

Either way, it's a good problem to have!

ryan-lp commented 1 year ago

It was not updated 56 times, but actually 90 times in a time span of 36 minutes

I see, it looks like I had recorded 2 log files and that observation came from the longer one.

If this is just the user hitting the save button many times, then it's a tricky one:

--> https://media.rss.com/outofthedryingpan/feed.xml
    https://feeds.buzzsprout.com/1998103.rss
    https://feeds.transistor.fm/masters-of-the-cinematic-universe
--> https://media.rss.com/outofthedryingpan/feed.xml
    https://media.rss.com/genteconproposito/feed.xml
--> https://media.rss.com/outofthedryingpan/feed.xml
    https://feeds.buzzsprout.com/1700716.rss
--> https://media.rss.com/outofthedryingpan/feed.xml

(+ 86)

Obviously we would want to avoid having every listener re-fetch and reprocess this URL each time the save button is frequently pressed, and @jamescridland 's algorithm looks like it would work.

But assuming that not every host will implement this algorithm, maybe something could be built into the podping protocol itself?

Otherwise I suspect we'll end up with everyone (both senders and receivers) implementing their own measures to handle this.

brianoflondon commented 1 year ago

I think we can close this issue @daveajones but I am keeping an eye on this.

ryan-lp commented 1 year ago

@jamescridland 's algorithm is essentially "debounce" with this additional idea of having "high priority" feed changes that bypass the debounce mechanism and go straight through immediately.

I think this could be handled by the podping protocol if we have a "priority" field in the message.

As hinted above, requiring every host to implement this might not be realistic, so there will still be some hosts that will ping the same URL frequently, and this will inevitably motivate the consumers downstream to defensively implement a similar sort of algorithm to debounce the stream of pings. But without knowing the priority, they won't be able to handle that special case where a high priority change is pinged (e.g. adding a new audio file).

If messages had priorities, this algorithm could be handled either defensively by the consumers or automatically by the podping itself.