fterh / rsg-retrivr

This Reddit bot is all about the "too lazy; didn't click" life
https://reddit.com/u/rsg-retrivr
7 stars 2 forks source link

Added feature to extract authors' names from articles #5

Closed changhuapeng closed 6 years ago

changhuapeng commented 6 years ago

I have implemented this feature to be as similar to the other existing selectors. But for sites like Bloomberg that have different HTML structures for its different sections/articles, I find that it is preferably and more consistent to use the metadata to get the authors' names for these sites.

~~For example to extract name from this meta tag: <meta name="author" content="John"> You can use the below text for the authors_selector in sites.json. "meta[name='author']"~~

~~Again, for sites like Bloomberg that uses different meta tag for different sections/articles like: <meta name="author" content="John"> or <meta name="parsely-author" content="John"> You can use the below text to state multiple tag names. "meta[name='author'|'parsely-author']"~~

Authors' names are currently left blank for articles that do not list any.

Edit: authors' names are added to articles using response from Mercury web parser.

fterh commented 6 years ago

Hey, I've reworked the parsing mechanism, plus I confess I don't quite understand your code. Instead of a pull, would you like to separately maintain an author module that returns the author? I'll be glad to implement that.

changhuapeng commented 6 years ago

No problem, let me go understand your new parsing implementation before deciding on how can we go about with the author module.

changhuapeng commented 6 years ago

For now, I have updated the code to use the author name from the response from Mercury web parser. From what I have tested, Mercury web parser cannot reliably scrape author's names and we may still need rules for each individual site should we want this feature.