AndrewStanton94 / HTMLInquisition

Tools to check that web pages have the expected content.
GNU General Public License v3.0
1 stars 0 forks source link

Extract quotes from news pages #2

Open AndrewStanton94 opened 5 years ago

AndrewStanton94 commented 5 years ago

Sample page: https://uopnews.port.ac.uk/2015/05/20/killer-fungus-finding-surprises-scientists/

Using [...document.querySelector('.hentry-content.clear').children].filter((elem) => elem.tagName === 'P').filter((elem) => elem.innerText.includes()) to get a list of article paragraphs.

Multi paragraph quotes omit the ending quote of paragraphs not at the end of the quote.

Need to extract the quoted people "$x said:". What about postfix attribution?

AndrewStanton94 commented 5 years ago

Implement in browser

Fetch page from input box.

Query DOM using above code.

Iterate items to find. List of standard paragraphs List of quote paragraphs

Check for presence of open quotes. Add adjacent quote paragraphs to array. Until paragraph with closing quote found.

Once lists populated. Loop through quotes. Extract text before and after quotes. Strip out known fluff.

Display the data on screen. Content editable Each item to have a delete button

Have output button to generate and populate table.