alan-turing-institute / ReadabiliPy

A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.
MIT License
230 stars 36 forks source link

Title extraction #57

Closed edwardchalstrey1 closed 5 years ago

edwardchalstrey1 commented 5 years ago

Title can now be pulled from HTML pages using rules based on the misinformation crawler site configs.

A relatively small variety of title tags covers all sites seen so far.

Tests have been added and existing tests updated to include extracted titles.

jemrobinson commented 5 years ago

Discussed in person.

edwardchalstrey1 commented 5 years ago

@jemrobinson

  1. Extraction moved to submodule
  2. Scoring system: All the title tags specified by the "extraction paths" are now used to search for a title in the HTML (see extract_title.py). The "extraction paths" are ranked by the order they are entered in the list. If multiple titles are extracted from the HTML, we use the most common title, unless all titles are equally common, in which case we choose the first title that was extracted, which corresponds to the highest ranked extraction path.

If you think we need a more complicated ranking/scoring system then let me know, but I think this way makes sense.

jemrobinson commented 5 years ago

Discussed in person, particularly the issue of constructing appropriate selectors in BeautifulSoup. Propose to do the following: