Podcastindex-org / podcast-namespace

A wholistic rss namespace for podcasting
Creative Commons Zero v1.0 Universal
382 stars 115 forks source link

Suggestion: Prioritize episode metadata file over RSS tags #184

Open theDanielJLewis opened 3 years ago

theDanielJLewis commented 3 years ago

We had this conversation on Mastodon, but I realized I never brought it here to GitHub. So let me summarize.

Chapters were moved from the XML to a per-episode "episode metadata" JSON file. They have their own chapters object. The file is then referenced from the Podcast Namespace chapters tag.

This identifies the feature in the RSS feed, but points to the file for the actual data. This allows episode-level edits without changing the RSS feed, and it allows players to get the episode-level data when needed, not only when the entire feed is refreshed. And it keeps the feed size smaller (there could be 100 chapters, but the feed gets only one additional XML tag per episode).

I suggest we follow this same pattern for other episode-level metadata that requires more than a single tag, pointing to the same episode metadata JSON file, which would have the details inside of an appropriate object.

For example, #180 could have a single XML tag pointing to the metadata file. Similarly <podcast:person> might have several entries that should be in the metadata file, but the XML feed links to it with only <podcast:people url="…" />.

So for every new tag proposal, I think we need to ask ourselves, "Does this need to be an XML tag, or can it be put in the episode metadata?"

eteubert commented 3 years ago

I have my troubles with the current chapters spec but missed my chance to comment there (#47) in time. But since this issue points in the same direction, I’m taking my chance put my thoughts here.

My main statement is: extracting simple episode metadata to an external file is a mistake.

The beauty of RSS is its simplicity. It’s easy to generate for the server because it’s just one file. And it’s easy to consume for clients because one request is all it takes to get all podcast and episode metadata. Now with your chapter spec the server needs to generate n+1 files (one feed and n episode JSON files) and the client needs to make n+1 requests to update a feed.

I see the most trouble in the clients. Writing code to fetch/parse/update a feed is suddenly much more complex and takes an order of magnitude more time because http is usually the bottleneck in these scenarios. On top of that, each request creates bandwidth overhead in the form of HTTP headers — and wasn’t one of the main arguments to extract the chapters (or other metadata) to save bandwidth? I’m not convinced that the current spec achieves that, at least with the usual chapter amount of maybe a dozen per episode.

This allows episode-level edits without changing the RSS feed, and it allows players to get the episode-level data when needed, not only when the entire feed is refreshed.

But how do I as a client or podcast directory know when episode metadata has changed? In the case of the chapter spec there is a version number but it’s in the file itself so I have to read the file to know if it changed. So I still need to read the feed and all referenced episode-level external files for updates. Only with your suggested specs it takes more time and traffic (HTTP overhead) and is harder to implement (n+1 requests instead of one).

There is a side-battle to be fought over introducing JSON as the format of those external metadata files but I feel like arguments about this are mostly subjective and hard to base on facts. There was already an attempt to create a JSON Feed spec and as far as I’m aware, short of the initial hype, it did not take off.

To summarise, my plea is to keep it simple. The more scattered data and formats are, the less likely clients are to adopt the specs.

brianoflondon commented 3 years ago

But how do I as a client or podcast directory know when episode metadata has changed?

I was having exactly this thought process. I scrape a feed, find an episode and mirror it on Hive. At the time I do this there is no chapters file. A few hours later Dreb Scott does his magic with Adam the chapters file is updated, how do I know to go back and look? One of my ideas is to mirror the chapters JSON in my post metadata on Hive (very easy to do) but I have to know its changed to go grab it.

PofMagicfingers commented 3 years ago

I do understand your concern and shared it at first, however in other issues we discussed how we could save bandwith for users and hosting providers by moving some data to external json files.

By moving all metadata to an external per episode file, we could keep the RSS with the minimal information, and allow clients to download metadata only when needed. Exactly like the www works : you do not embed images and sound inside the html file, you reference them.

If your concern is knowing when metadata has changed, that's mine too, we could imagine using a hash somewhere in any format : <podcast:metadata url="[url]" hash="[some sha1 or even crc32 or a date, free format]">

Why not using something like in html with integrity if we want it to be more secure : https://developer.mozilla.org/fr/docs/Web/Security/Subresource_Integrity

<podcast:metadata url="https://exemple.com/episode1.json" integrity="sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC">

It could be annoying to implement on hosting side but it does fix knowing when it changed and knowing if we got the correct file.

brianoflondon commented 3 years ago

If your concern is knowing when metadata has changed, that's mine too, we could imagine using a hash somewhere in any format :

This. We need this. And if this hash can be a call to PodcastIndex instead of downloading the entire RSS feed then we have a winner.

PofMagicfingers commented 3 years ago

if this hash can be a call to PodcastIndex instead of downloading the entire RSS feed then we have a winner.

I can't agree with that, PodcastIndex is a beautiful project but IMHO the podcast namespace should be generic and should not make podcast index api mandatory.

IIRC we already have something to link a podcast to its podcast index data with podcast:id. Maybe we could expand usage of this tag to item and allow rss feed to link to episodes pages in platforms. But that is another chat for another issue ^^

brianoflondon commented 3 years ago

if this hash can be a call to PodcastIndex instead of downloading the entire RSS feed then we have a winner. I can't agree with that, PodcastIndex is a beautiful project but IMHO the podcast namespace should be generic and should not make podcast index api mandatory.

I should have added ALSO rather than INSTEAD. Yes to having the hash in the feed, as well as having PodcastIndex duplicate it alongside the episode data.

adamc199 commented 3 years ago

For your consideration:

Including all data in the RSS feed creates vendor lock in to the hosting provider.

The namespace tag data can be created by a wide variety of authoring tools.

External data files opens the door to a lot of content innovation.

PofMagicfingers commented 3 years ago

I'm not sure what you mean by a vendor lock. Why would it be more a vendor lock when including data in rss or less of it when using an external json file ?

Using an external file or using tags in RSS would probably not change how you enter those informations into your hosting provider or UX.

However it's sure thing that using a json metadata file allow us to expand the spec really easily and be more future proof.

For your consideration:

Including all data in the RSS feed creates vendor lock in to the hosting provider.

The namespace tag data can be created by a wide variety of authoring tools.

External data files opens the door to a lot of content innovation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Podcastindex-org/podcast-namespace/issues/184#issuecomment-774457487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADST7MDD5HFTOLBNRVZARLS5URLHANCNFSM4XBOUNIQ .

eteubert commented 3 years ago

However it's sure thing that using a json metadata file allow us to expand the spec really easily and be more future proof.

Can you elaborate on that? The "X" in XML literally stands for "Extensible". It's the reason we can simply introduce a new namespace like this one and add new data to the feed without affecting existing implementations. I don't see how a JSON metadata file would enhance on that.

PofMagicfingers commented 3 years ago

Yeah you're right about that.

I was thinking of JSON as more flexible and extensible because you do not need to define a custom namespace to add an attribute. It also is more modern and smaller for the same amount of data.

IMHO it makes sense to use it for episode metadata because we could add new types or data with something like json-ld easily (as discussed in #180)

When using rss tags only we would need to create new tags in the namespace (or a new namespace with new tags) for each new data kind. That could feel "bloated" really fast.

Anyway, you're still right, we also could do this with xml tags in RSS, but that's not the point of this issue which is I think to externalize this data to save bandwidth primarily, and to prevent RSS to get too bulky.

However it's sure thing that using a json metadata file allow us to expand

the spec really easily and be more future proof.

Can you elaborate on that? The "X" in XML literally stands for "Extensible". It's the reason we can simply introduce a new namespace like this one and add new data to the feed without affecting existing implementations. I don't see how a JSON metadata file would enhance on that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Podcastindex-org/podcast-namespace/issues/184#issuecomment-774459354, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADST7PNDBSFBG3T7EKDDLLS5UTD5ANCNFSM4XBOUNIQ .

adamc199 commented 3 years ago

if you re-read my comment, you will find that I am advocating FOR an external file.

Vendor lock-in occurs when it is NOT an external file.

Reason: I can create chapter files in hypercatcher or podfriend but would be unable to do that if restricted to a hosting company UX only.

PofMagicfingers commented 3 years ago

if you re-read my comment, you will find that I am advocating FOR an external file.

Yeah I know, re-read my comment ^^

Vendor lock-in occurs when it is NOT an external file.

I was wondering why would it occur when it's not an external file.

Reason: I can create chapter files in hypercatcher or podfriend but would be unable to do that if restricted to a hosting company UX only.

Yeah but I don't see any link between hosting company UX and final technical implementation of it.

Your hosting company could provide you an UX for your chapters, and metadata or allow you to import a file from podfriends etc regardless of the final implementation (external file or rss tags)

External file or rss tags, all of this aim to be generic and vendor independant so I see no vendor lock in any solutions

eteubert commented 3 years ago

Agreed, let's not have the JSON vs. XML discussion here :) Back to the topic:

By moving all metadata to an external per episode file, we could keep the RSS with the minimal information, and allow clients to download metadata only when needed. Exactly like the www works : you do not embed images and sound inside the html file, you reference them.

It took a while for me to digest that statement. In fact, we already do that in the current RSS spec with images. However the web is read by web browsers, which are much more complex beasts than an XML parsing library :) And as far as I'm aware of, a very common support request for podcast hosters is: "I changed my podcast cover, why isn't it updated in iTunes/Spotify/... yet?" Because caching is hard, that's why ;) -- and if we indeed want to externalize more data, this needs to be handled better.

Having some kind of cache-key/hash/etag for that on all links to external files might help indeed. But again, we need to anticipate implementation details, like: when an external file is updated, the hash in the feed updates as well. Server caches for both files must be flashed proactively, otherwise the feed might claim there's a new file when there's still a stale cache version being served. Again, caching is hard.

Speaking of caching, I'm wondering how consistently podcast hosters set the etag/last-modified HTTP headers of their RSS feeds. Because if they don't and clients usually have to download the whole feed instead of just doing a HEAD request, it may be worth promoting this more as this then would really save a lot of bandwith. Maybe someone who has to parse a lot of feeds can chime in with some first-hand experience here.

PofMagicfingers commented 3 years ago

Agreed, let's not have the JSON vs. XML discussion here :) Back to the topic

Well, in fact that's the topic of this issue! 😅

@theDanielJLewis opened it to discuss exactly this :

Chapters were moved from the XML to a per-episode "episode metadata" JSON file. [...] I suggest we follow this same pattern for other episode-level metadata that requires more than a single tag, pointing to the same episode metadata JSON file.

I suggested using the idea of HTML new integrity attribute for a podcast:metadata to fix caching and integrity issues.

We indeed could use that same idea for other resources, that's really something we should look into (I got support emails with cover image problems monthly...), however caching and hashes in a general way is a topic for another issue IMHO.

swschilke commented 3 years ago

Having everything in one place / file is easier to handle. If you spread everything across multiple files can (and will) be more error-prone. The split between meta data (RSS) and content is a resonable split (so you don't have to load everything at once). I think that does not apply to additional information. E.g., some people argue that chapters belong into the ID3 tags of the MP3 file.

vv01f commented 3 years ago

on mp3 … 1st it might not be mp3 at all or even no audio / video … 2nd community collaboration on chapters make it impossible to add that information in the mp3

PofMagicfingers commented 3 years ago

IMHO, it's better to split episode metadata into another file per episode.

Mainly, it reduces feed size, and bandwith usage. It allows people on a mobile connection to quickly update their podcasts, without downloading a huge amount of data (chapters, metadatas, topics etc) for all episodes, even if they won't download/listen this episode.

The client app can download this metadata file alongside the media content. You click play or download, your app fetches the metadata. You save space and bandwith by downloading only what you need when you need it.

Futhermore, we're building a new standard here, so not every app will support it right now : IMHO, it's better to avoid apps download a huge RSS file with information they're not compatible with. It's not a big issue when it's a couple new RSS tags. If it's a chapter list, or a full list of topics with metadata, it could become a size and bandwith nightmare.

theDanielJLewis commented 3 years ago

I'm finally catching up to all of this. So I'll share my thoughts to different points.

Writing code to fetch/parse/update a feed is suddenly much more complex and takes an order of magnitude more time because http is usually the bottleneck in these scenarios.

The idea of episode metadata files is to offload them from the RSS feed. Then, a podcast app would download the episode meta (or check for an update) only when the episode is played. This would probably be no more than a few KB, which is much better than refreshing a multi-MB RSS feed.

But how do I as a client or podcast directory know when episode metadata has changed?

The metadata should have header information to make it obvious when it was last modified. Then, any app would need to check only the headers before redownloading the data file.

eteubert commented 3 years ago

Then, a podcast app would download the episode meta (or check for an update) only when the episode is played.

Podcatchers are not the only clients. Directories (and directory-like backends for services) need to keep episode data updated as well and they'll still need to "check everything" periodically. My worry is that implementation complexity will become an issue for them.

IMHO, it's better to avoid apps download a huge RSS file with information they're not compatible with. It's not a big issue when it's a couple new RSS tags. If it's a chapter list, or a full list of topics with metadata, it could become a size and bandwith nightmare.

To reduce bandwidth for podcatchers, the most effective way is to push usage of feed pagination (see https://github.com/Podcastindex-org/podcast-namespace/issues/117#issuecomment-754540578). If we can cut the average feed request down from 50++ episodes to the latest 10, that reduces bandwidth usage for podcatcher dramatically. I wouldn't worry about a few extra kb added by adding chapter information (or other small amounts of metadata) to the feed.

PofMagicfingers commented 3 years ago

What about my idea (a few message above) of mimicking html new integrity attribute?

It does fix the issue of knowing when metadata has changed. If we include topics, chapters, persons and more data into episodes, that could get pretty large. My preferences, as a podcast directory manager and developer goes to an external file I will most of the time download only once.

About pagination, on a directory pov, it lighten the feed, but we have no way to know in a simple way if an episode on distant pages has been modified or deleted. (but that's probably a chat to have on #117)

eteubert commented 3 years ago

What about my idea (of mimicking HTML new integrity attribute?

Yes, if external files are referenced there has to be a functionality like that. A hash of the referenced content seems fine to me.

About pagination, [...] we have no way to know in a simple way if an episode on distant pages has been modified or deleted.

Fair point. But that's an issue even with the current RSS standard that it's tricky figuring out what's new / deleted. Has the latter been discussed here yet? Maybe an item tag with only the guid tag and a new podcast:deleted/removed/unpublished tag inside it? If this has not come up yet, I'll create a new issue.

theDanielJLewis commented 3 years ago

I now understand the benefit of a hash or timestamp for each episode metadata from the RSS feed. This would allow even directories to simply compare a small string instead of even an HTTP header request. I'm on board with this now.

PofMagicfingers commented 3 years ago

@theDanielJLewis exactly, we compare only this attribute, and url of the metadata json file, and we can know if it has changed if any of these two values have changed.

This hash could be a free-form string allowing a timestamp, any content or even a integrity checksum, like so : <podcast:metadata url="" hash="sha1:a3b4c2...">

IMHO it should be optional, but recommended. You could either use it, or simply change url of the json when it changes, like we do for enclosure for Spotify.

theDanielJLewis commented 3 years ago

We could have the hash on only the metadata tag, which would be required if any other episode metadata is used, like chapters, so that an app would know it doesn't have to redownload the data for chapters if it already knows it has the latest.

(The reason I still think chapters should be a separate RSS tag, even though the data is in the separate file, is that RSS tag shows the app/directory what features there are, while it gets the data to power those features from the external file. If we relied on only the metadata file to communicate the existence of features, apps/directories might have to download the metadata before playback.)

PofMagicfingers commented 3 years ago

You're right! Another idea could be using a rel attribute listing a comma separated list of what the file will contain :

<podcast:metadata rel="chapters, topics, people" url="" hash="sha1:a3b4c2...">

Or multiple tags with one rel for each?

We could have the hash on only the metadata tag, which would be required if

any other episode metadata is used, like chapters, so that an app would know it doesn't have to redownload the data for chapters if it already knows it has the latest.

(The reason I still think chapters should be a separate RSS tag, even though the data is in the separate file, is that RSS tag shows the app/directory what features there are, while it gets the data to power those features from the external file. If we relied on only the metadata file to communicate the existence of features, apps/directories might have to download the metadata before playback.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Podcastindex-org/podcast-namespace/issues/184#issuecomment-808375572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADST7M4VWEFYVFACYVJUSDTFS5AFANCNFSM4XBOUNIQ .

theDanielJLewis commented 3 years ago

I'm not sure how willing @daveajones would be to alter the current chapters spec. But I do like the idea of merely indicating the features with a single string instead of a tag. chapters, is 10 characters, but the chapters RSS tag is much more. That seems trivial for one instance, but multiply that times 50, 100, or thousands.

Plus, this could save lots of space for other features, too.

I'm not sure rel would be the best attribute. Maybe it could be features.

<podcast:metadata features='chapters,people,value' hash='439gues4jigs94jrg943tg' href='…' />
PofMagicfingers commented 3 years ago

I really like this idea of a versatile file containing all metadata which can be lazy loaded by clients only when needed, and updated by clients if either the url, hash, or features change.

This idea does prevent clutter a lot. Everything is well stored in separate json file per episode.

tomrossi7 commented 3 years ago

This hash could be a free-form string allowing a timestamp, any content or even a integrity checksum...

I just wanted to throw out that the proper use of ETags and HTTP 304s should eliminate the need for a hash. We don't want to duplicate what has already been built on HTTP and ETags exist for exactly this reason. If you wanted to store a hash value, you should use the ETag value. Does that make sense?

PofMagicfingers commented 3 years ago

It does make sense, and I'm all about not reinventing the wheel. But as a hosting provider, I'd rather get no request at all from clients if not needed, than a request with a If-Match.

We could also use what many uses for static ressources : changing the url when the resource has changed, include the hash in the url. It's how Spotify and many others detect a change of enclosure.

agates commented 2 years ago

I think we need to support non-http sources for external files just like alternateEnclosure. The world wont always be web services.