Open changhuapeng opened 6 years ago
One method is to replace the soup.select("p")
, at line 43 of mercury.py
, with soup.find_all(lambda tag: tag.name == "p" and not tag.attrs)
with the assumption that all legitimate paragraphs are encapsulated with <p>
without any classes, id, etc attributes. This means that the mistaken <p class="row">
will then be then omitted by the bot because of the class attribute.
I have tested out this solution with multiple past articles posted by the bot and found it to be generally reliable except for this yahoo article and scmp.com article where certain paragraphs are omitted by the bot because of their <p class="xxxx">
tag.
As per title, the bot is double posting certain articles from straitstimes.com
The Mercury web parser seems to be confused with
<div class="row">
, an outer div element that encapsulates the whole article to be<p class="row">
. The bot then mistakenly post the content in this<p class="row">
assuming it to be the first paragraph before moving on to post the actual scraped paragraphs.