We're currently using the syndication library's feed discovery, which uses regex to scrape the feed for potential feed sources. Unfortunately since it doesn't actually read the HTML, it's quite unreliable even if the HTML file actually includes metadata. If the discovered feed doesn't exist, load and parse, we should use our HTML parser to implement a better discovery algorithm:
link rel=alternate tags. The library kind of scans for these, but misses some that aren't formatted exactly as it expects, and sometimes picks up non-xml links (e.g. other languages)
links with (xml|rss|rdf|atom) (anchor text|alt text|extension|trailing path component|get parameter value)
try the default WordPress location at /feed
maybe recurse if step 2 produces non-error HTML (e.g. an rss link that points to a landing page instead of directly to a feed)
We're currently using the syndication library's feed discovery, which uses regex to scrape the feed for potential feed sources. Unfortunately since it doesn't actually read the HTML, it's quite unreliable even if the HTML file actually includes metadata. If the discovered feed doesn't exist, load and parse, we should use our HTML parser to implement a better discovery algorithm: