AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
422 stars 34 forks source link

Parse Issue: some sites use <br> for newlines but the parse doesn't add spaces between sentences #240

Open AndyTheFactory opened 10 months ago

AndyTheFactory commented 10 months ago

Issue by HodorTheCoder Tue Sep 4 21:25:06 2018 Originally opened as https://github.com/codelucas/newspaper/issues/621


For example, from this article: https://abc7ny.com/15-year-old-girl-dies-after-5-story-fall-from-fire-escape/4134009/

A snippet from the HTML from the above article using <br><br> to seperate paragraphs:

A teenage girl died after she fell from the fire escape of an apartment building in Lower Manhattan late Sunday.<br><br>Police say 15-year-old Imogen Roche was attending a party inside an apartment building on Reade Street in Tribeca.<br> <div class="adRectangle-pos-small-inline" data-set="adAppend"></div> <br>It appears she left her cell phone in a room that was locked just before 11 p.m.<br><br>Authorities say Imogen went onto the fire escape, attempting to reenter the apartment by going in another window, when she lost her balance and fell.<br><br>

This translates to the following text when you parse it:

A teenage girl died after she fell from the fire escape of an apartment building in Lower Manhattan late Sunday.Police say 15-year-old Imogen Roche was attending a party inside an apartment building on Reade Street in Tribeca.It appears she left her cell phone in a room that was locked just before 11 p.m.Authorities say Imogen went onto the fire escape, attempting to reenter the apartment by going in another window, when she lost her balance and fell.

You can see that the sentenes collide without spaces in the parsed output where the <br><br> show up. Sunday.Police, Tribeca.It, p.m.Authorities, etc.

This is due to poor markup on the news site, for sure, but it happens a ton actually. Any chance we could check for <br>'s after a period or sentence ending punctuation and add a space in the parsed output to distinguish sentences? It's hard for my NLP post processor to distinguish, for example, "Sunday.Police" as anything but one word because there are no spaces. Thanks. I would think this is an easy fix, right?

edit: title/HTML formatting

AndyTheFactory commented 10 months ago

Comment by andho Thu Nov 1 15:52:21 2018


Just for clarity <br> is the correct way for line break in HTML5. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/br

AndyTheFactory commented 10 months ago

Comment by kut Fri Mar 29 16:54:30 2019


Running into same thing here - also means we lose the paragraph information...

AndyTheFactory commented 10 months ago

Comment by shkarupa-alex Mon Mar 2 13:51:41 2020


+1 to have newline for each
in source html