Closed edwardchalstrey1 closed 5 years ago
Discussed in person.
extractors/extract_title.py
)@jemrobinson
If you think we need a more complicated ranking/scoring system then let me know, but I think this way makes sense.
Discussed in person, particularly the issue of constructing appropriate selectors in BeautifulSoup. Propose to do the following:
lxml.etree
to parse the page and use .xpath()
to pull out the appropriate elements
Title can now be pulled from HTML pages using rules based on the misinformation crawler site configs.
A relatively small variety of title tags covers all sites seen so far.
Tests have been added and existing tests updated to include extracted titles.