danmactough / node-feedparser

Robust RSS, Atom, and RDF feed parsing in Node.js
Other
1.97k stars 192 forks source link

How to handle html from the `Author` field? #290

Closed chuanqisun closed 3 years ago

chuanqisun commented 3 years ago

Before submitting your issue, please make sure these boxes are checked. Thank you!

Problem feed meta: image

In the feed item, the author field contains HTML: image

The parser strips the entire <a> tag from the author property in the output image

The rss:author property has some additional information but I think it's difficult write generalized extract logic as the structure can differ from feed to feed image

I wonder if there is an easy way to just get the plaintext within the Author field by Preston So.

Thanks!

danmactough commented 3 years ago

That feed is not valid https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Falistapart.com%2Fmain%2Ffeed%2F This is a sad but common problem when parsing feeds. Feedparser doesn't have an opinion about how you should handle invalid feeds -- everyone kind of needs to figure that out for themself given the goals of the project they're working on.

I wonder if there is an easy way to just get the plaintext within the Author field by Preston So

For this specific workaround, the # property contains the plain text parts of the original feed item. So, you would need to recursively parse the rss:author property to pull out the # properties, then join them together with a space.