Title extraction - Githubissues

alan-turing-institute / ReadabiliPy

A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.

MIT License

230 stars 36 forks source link

Closed edwardchalstrey1 closed 5 years ago

edwardchalstrey1 commented 5 years ago

Title can now be pulled from HTML pages using rules based on the misinformation crawler site configs.

A relatively small variety of title tags covers all sites seen so far.

Tests have been added and existing tests updated to include extracted titles.

jemrobinson commented 5 years ago

Discussed in person.

edwardchalstrey1 commented 5 years ago

@jemrobinson

Extraction moved to submodule
Scoring system: All the title tags specified by the "extraction paths" are now used to search for a title in the HTML (see extract_title.py). The "extraction paths" are ranked by the order they are entered in the list. If multiple titles are extracted from the HTML, we use the most common title, unless all titles are equally common, in which case we choose the first title that was extracted, which corresponds to the highest ranked extraction path.

If you think we need a more complicated ranking/scoring system then let me know, but I think this way makes sense.

jemrobinson commented 5 years ago

Discussed in person, particularly the issue of constructing appropriate selectors in BeautifulSoup. Propose to do the following:

construct XPaths for different title options (plus appropriate trustability score)
use lxml.etree to parse the page and use .xpath() to pull out the appropriate elements
sort unique extracted text strings by sum of scores