Closed changhuapeng closed 6 years ago
Hey, I've reworked the parsing mechanism, plus I confess I don't quite understand your code. Instead of a pull, would you like to separately maintain an author module that returns the author? I'll be glad to implement that.
No problem, let me go understand your new parsing implementation before deciding on how can we go about with the author module.
For now, I have updated the code to use the author name from the response from Mercury web parser. From what I have tested, Mercury web parser cannot reliably scrape author's names and we may still need rules for each individual site should we want this feature.
I have implemented this feature to be as similar to the other existing selectors. But for sites like Bloomberg that have different HTML structures for its different sections/articles, I find that it is preferably and more consistent to use the metadata to get the authors' names for these sites.~~For example to extract name from this meta tag:
<meta name="author" content="John">
You can use the below text for the authors_selector in sites.json."meta[name='author']"
~~~~Again, for sites like Bloomberg that uses different meta tag for different sections/articles like:
<meta name="author" content="John">
or<meta name="parsely-author" content="John">
You can use the below text to state multiple tag names."meta[name='author'|'parsely-author']"
~~Authors' names are currently left blank for articles that do not list any.Edit: authors' names are added to articles using response from Mercury web parser.