Consider adding url-level receive_time to the blockchain payload

johnspurlock commented 3 years ago

When a ping comes in from the host, it would be great to capture this request time along with the associated url when later writing to the block - this info is currently lost due to the queuing done before writing to the blockchain.

It's the truest age time we have, on which to base all downstream delay times.

This would require changing the payload to supporting url-level attributes, which could be useful for other attributes in the future.

e.g.

"version": "0.3",
...
"urls": {
    "https://rss.whooshkaa.com/rss/podcast/id/8209": { "receive_time": "2021-05-22T22:28:36.473Z" },
    "https://feeds.buzzsprout.com/262529.rss": { "receive_time": "2021-05-22T22:29:36.473Z" }
}

"version": "0.2",
...
"urls": [
    "https://rss.whooshkaa.com/rss/podcast/id/8209",
    "https://feeds.buzzsprout.com/262529.rss"
]

stevencrader commented 3 years ago

I agree that it might be good to transition to a format that allows including other attributes about the feed. I would suggest using a list of objects instead. We currently have some cases where the same feed URL will show up in the block multiple times.

"version": "0.3",
...
"urls": [
    {
        "url": "https://rss.whooshkaa.com/rss/podcast/id/8209",
        "receive_time": "2021-05-22T22:28:36.473Z" 
    },
    {
         "url": "https://feeds.buzzsprout.com/262529.rss",
         "receive_time": "2021-05-22T22:29:36.473Z"
    }
]

In this format, maybe we change the top-level urls to something else.

johnspurlock commented 3 years ago

Works for me - under what circumstances would the same feed url show up in the same block batch? Two quick pings from the same source (source in this case means buzzsprout, etc), or two pings from different sources?

If the former, should probably just coalesce it within a single block. If the latter, should probably add a source attribute at the item-level as well.

Perhaps name the collection pings instead of urls.

stevencrader commented 3 years ago

The pings are coming from the hosts. From what it looks like, they (ex: Buzzsprout) sends a ping anytime a change is saved. If 2 saves are done close together, they can end up in the same block when published to hive.

In a thread the other day on podcastindex.social, we were discussing these duplicate pings and what to do with them. Right now, I don't think any filtering is being applied and all pings are written to the blocks. The app or service that is watching the index could decide to ignore a ping if received within a close time frame. Here is that thread.

daveajones commented 3 years ago

This all sounds fine to me. If we make the change now, future metadata additions will be easy.

@brianoflondon ?

brianoflondon commented 3 years ago

I'm very much in favor of allowing for more meta data in a ping but I don't think received time is the one to go for.

We're talking here about under 30 seconds of resolution for an item which for the vast majority of podcasts has a resolution of 1 week or perhaps a day. We can solve the double post thing by delaying things for a minute or two and de-duplicating but it really seems to be better to just let almost everything through. I certainly could de-duplicate the URL list in any given custom_json.

Right now the writer accepts URLs for up to 3s or until it reaches 130 URLs which I've tested and keeps us below the max size of a custom_json which is 8192 bytes. There's no point measuring exactly, 130 keeps us comfortably below and is huge!

HIVE_OPERATION_PERIOD = 3  # 1 Hive operation per this period in
MAX_URL_PER_CUSTOM_JSON = 130 # total json size must be below 8192 bytes

3s is exactly the Hive block time and I haven't noticed us sending multiple pings in the same block (which isn't really a problem anyway and will happen later when multiple entities are writing directly to Hive).

I've just added a deduplicating step within that 3s block, we could push that to 5s or 10s which would catch more but Dave is already buffering things on his side.

If we had a received time that means Dave sending that all the way up the chain through his queue which means he sends two pieces of info per ping instead of one and hive-writer has to maintain those.

However, in order to preserve the simplest list format when we're just pinging I would prefer something like the following having two separate lists one with just the URLs and a parallel list for meta_data which matches the first in length with whatever arbitrary data we want. This stops us repeating empty meta data dictionary keys all over the place or even having them most of the time. It's trivial to recombine these two lists in Python.

{
    "required_auths": [],
    "required_posting_auths": [
        "blocktvnews"
    ],
    "id": "podping",
    "json": {
        "version": "0.3",
        "num_urls": 3,
        "reason": "feed_update",
        "urls": [
            "https://feeds.simplecast.com/_lOI3coD",
            "https://audioboom.com/channels/5027138.rss",
            "https://feeds.transistor.fm/orthodox-lectionary"
        ],
        "meta_data" : [
            {
                "source": "simplecast",
                "whatever": "else"
            },
            {
                "source": "audioboom",
                "we_can": "think_of"
            },
            {
                "source": "transistor"
            }
        ]
    }
}

brianoflondon commented 3 years ago

Philosophical point:

We are writing to an immutable database. I'm trying to be mindful all the time of what we are doing and the impact on the whole chain. Podping isn't going to break Hive or even trouble it significantly and we will bring much more value to Hive in terms of proving its utility than we will take out.

However, I'm trying to be as careful as I can with extra data. I'm even looking at version number now and wondering if that really gives us much when it is repeated 28,000 times per day because maybe it will change in a week or two! I don't think anyone is making use of that version number right now so I'm thinking to remove both version and reason until we actually are using them.

For sure I foresee reason being used for a going live update, but we're not there yet!

daveajones commented 3 years ago

I really wouldn’t remove version and reason. It feels better to just shorten them and document the schema here in the repo. For instance just “v” for version and make it an integer. That’s incredibly tiny. And for “reason”, just use “r” and an integer value with the mapping of numbers to reason codes here in the repo. I already have a use for reason that I just discussed with someone yesterday. We’re only talking about 10 bytes of extra JSON data per block if we pare it down like that.

daveajones commented 3 years ago

Is the full current JSON schema even documented here yet? Maybe I missed it.

brianoflondon commented 3 years ago

I'll work on this. And documenting the custom_json properly.

seakintruth commented 3 years ago

Just have to remember as additional meta-data is added the current MAX_URL_PER_CUSTOM_JSON may need to be reduced to accommodate.

Podcastindex-org / podping.cloud

Consider adding url-level receive_time to the blockchain payload #36