codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.15k stars 2.12k forks source link

Duplicate content on certain site #719

Open JohnChu101 opened 5 years ago

JohnChu101 commented 5 years ago

I tried with this site: https://www.seoul.co.kr/news/newsView.php?id=20190815001004&wlog_sub=svt_006

and the article.article_html returns duplicate content. If you find in that page, you will only find one match of "EMP" (even in the html source code), but in article_html it appears twice. Here is my code:

from newspaper import Article, Config as NewspaperConfig
url="https://www.seoul.co.kr/news/newsView.php?id=20190815001004&wlog_sub=svt_006"
conf = NewspaperConfig()
article = Article(url, config=conf, keep_article_html=True, language = 'ko')
article.download()
article.parse()
print(article.article_html)
print(article.text)
unkwn1-repo commented 5 years ago

I didn't get duplicate content but I tend to process articles different.

Below is a copy and paste from an ipython shell In [1]: from newspaper import Article

In [2]: url = 'https://www.seoul.co.kr/news/newsView.php?id=20190815001004&wlog_ ...: sub=svt_006'

In [3]: article = Article(url)

In [4]: article.download()

In [5]: article.parse()

In [6]: print(article.text) ▲ 정경두 국방부 장관이 5일 국회에서 열린 국방위원회 전체회의에서 한 의원의 질문에 답하고 있다.

김명국 선임기자 daunso@seoul.co.kr

국방부가 북한의 탄도미사일 위협 대응과 장병 복지 강화를 위해 5년간 290조원가량의 국방비를 투입하기로 했다. 올해 46조 6000억원이던 국방 예산은 내년부터 연간 50조원을 돌파할 전망이다. 2022년으로 예상되는 전시작전통제권 전환을 포함해 안보 불안감을 해소하려는 취지로 보인다.국방부는 14일 향후 5년 동안 군사력 건설과 운영 계획을 담은 ‘2020~2024 국방중기계획’을 발표했다. 지난 1월 발표한 2019~2023년 중기계획(270조 7000억원)보다는 19조 8000억원이 증액됐다. 우선 방위력 개선에 103조 8000억원을 투입한다. 2019~2023 중기계획(94조 1000억원)보다 9조 7000억원이 늘었다. 패트리엇과 철매2 등 한국형 미사일방어(KAMD) 체계의 방어 지역을 확대하고 미사일 요격 능력을 더욱 높이고, 군 정찰위성을 전력화하는 등 북한의 신형 탄도미사일 탐지·요격 능력을 강화한다.내년부터 F35B 수직 이착륙 스텔스 전투기를 탑재할 수 있는 다목적 대형수송함(3만t 경항모급) 개념설계에 착수하고, 유사시 북한 전력망을 무력화할 수 있는 정전탄과 전자기펄스(EMP)탄을 개발한다.이주원 기자 starjuwon@seoul.co.kr

In [7]:

JohnChu101 commented 5 years ago

@jessefogarty It's in the article_html. First you need to set keep_article_html=True, then print article_html. You will find two matches of "EMP" in the output. While if you search on the original webpage, there's only one match even in its html source code.

unkwn1-repo commented 5 years ago

Could you take a screenshot of what you mean.

I'm looking at the source via Chrome inspection tool and am only seeing one match for "EMP" which is in the Korean text.

Also, are you using the most up to date version of newspaper3k (just double checking so we can rule out things).

Cheers,

If you forgot your conversation password. Please text me using Signal @ (647)571-7808.

Aug 15, 2019, 02:29 by notifications@github.com:

@jessefogarty https://github.com/jessefogarty OK. It's in the article_html. First you need to set keep_article_html=True, then print article_html. You will find two matches of "EMP" in the result. While if you search on the original webpage, there's only one match even in it's html source code.

— You are receiving this because you were mentioned. Reply to this email directly, > view it on GitHub https://github.com/codelucas/newspaper/issues/719?email_source=notifications&email_token=AEJOBZSODMALYUQLXT2DUWTQETZVBA5CNFSM4IL2VRVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4K644Y#issuecomment-521530995> , or > mute the thread https://github.com/notifications/unsubscribe-auth/AEJOBZWDJENTCYVVULX3LTTQETZVBANCNFSM4IL2VRVA> .

JohnChu101 commented 5 years ago

@jessefogarty Yes you are right. What I mean is on the original webpage there's only one match of "emp". But if you run my code above and print(article.article_html), you will find two matches in the output, which means the content is repeated.

JohnChu101 commented 5 years ago

newspaper3k==0.2.8 my output:

>>> print(article.article_html)
<div><a href="//img.seoul.co.kr//img/upload/2019/08/05/SSI_20190805171107.jpg" title="&#51221;&#44221;&#46160; &#44397;&#48169;&#48512; &#51109;&#44288;&#51060; 5&#51068; &#44397;&#54924;&#50640;&#49436; &#50676;&#47536; &#44397;&#48169;&#50948;&#50896;&#54924; &#51204;&#52404;&#54924;&#51032;&#50640;&#49436; &#54620; &#51032;&#50896;&#51032; &#51656;&#47928;&#50640; &#45813;&#54616;&#44256; &#51080;&#45796;.&lt;br&gt;&#44608;&#47749;&#44397; &#49440;&#51076;&#44592;&#51088; daunso@seoul.co.kr"><img src="//img.seoul.co.kr/img/upload/2019/08/05/SSI_20190805171107_V.jpg" border="0" alt="&#51221;&#44221;&#46160; &#44397;&#48169;&#48512; &#51109;&#44288;&#51060; 5&#51068; &#44397;&#54924;&#50640;&#49436; &#50676;&#47536; &#44397;&#48169;&#50948;&#50896;&#54924; &#51204;&#52404;&#54924;&#51032;&#50640;&#49436; &#54620; &#51032;&#50896;&#51032; &#51656;&#47928;&#50640; &#45813;&#54616;&#44256; &#51080;&#45796;. &#44608;&#47749;&#44397; &#49440;&#51076;&#44592;&#51088; daunso@seoul.co.kr"></a>      &#13;
                                  &#13;
                                   &#13;
                                  <img src="//img.seoul.co.kr/img/mexpand.png" title="&#46027;&#48372;&#44592;" alt="&#53364;&#47533;&#54616;&#49884;&#47732; &#50896;&#48376; &#48372;&#44592;&#44032; &#44032;&#45733;&#54633;&#45768;&#45796;."><p class="v_photo_caption">&#9650; &#51221;&#44221;&#46160; &#44397;&#48169;&#48512; &#51109;&#44288;&#51060; 5&#51068; &#44397;&#54924;&#50640;&#49436; &#50676;&#47536; &#44397;&#48169;&#50948;&#50896;&#54924; &#51204;&#52404;&#54924;&#51032;&#50640;&#49436; &#54620; &#51032;&#50896;&#51032; &#51656;&#47928;&#50640; &#45813;&#54616;&#44256; &#51080;&#45796;.<br>&#44608;&#47749;&#44397; &#49440;&#51076;&#44592;&#51088; daunso@seoul.co.kr</p>&#13;
&#13;
&#13;
                        <br><br>&#44397;&#48169;&#48512;&#45716; 14&#51068; &#54693;&#54980; 5&#45380; &#46041;&#50504; &#44400;&#49324;&#47141; &#44148;&#49444;&#44284; &#50868;&#50689; &#44228;&#54925;&#51012; &#45812;&#51008; &#8216;2020~2024 &#44397;&#48169;&#51473;&#44592;&#44228;&#54925;&#8217;&#51012; &#48156;&#54364;&#54664;&#45796;. &#51648;&#45212; 1&#50900; &#48156;&#54364;&#54620; 2019~2023&#45380; &#51473;&#44592;&#44228;&#54925;(270&#51312; 7000&#50613;&#50896;)&#48372;&#45796;&#45716; 19&#51312; 8000&#50613;&#50896;&#51060; &#51613;&#50529;&#46096;&#45796;. &#50864;&#49440; &#48169;&#50948;&#47141; &#44060;&#49440;&#50640; 103&#51312; 8000&#50613;&#50896;&#51012; &#53804;&#51077;&#54620;&#45796;. 2019~2023 &#51473;&#44592;&#44228;&#54925;(94&#51312; 1000&#50613;&#50896;)&#48372;&#45796; 9&#51312; 7000&#50613;&#50896;&#51060; &#45720;&#50632;&#45796;. &#54056;&#53944;&#47532;&#50631;&#44284; &#52384;&#47588;2 &#46321; &#54620;&#44397;&#54805; &#48120;&#49324;&#51068;&#48169;&#50612;(KAMD) &#52404;&#44228;&#51032; &#48169;&#50612; &#51648;&#50669;&#51012; &#54869;&#45824;&#54616;&#44256; &#48120;&#49324;&#51068; &#50836;&#44201; &#45733;&#47141;&#51012; &#45908;&#50865; &#45458;&#51060;&#44256;, &#44400; &#51221;&#52272;&#50948;&#49457;&#51012; &#51204;&#47141;&#54868;&#54616;&#45716; &#46321; &#48513;&#54620;&#51032; &#49888;&#54805; &#53444;&#46020;&#48120;&#49324;&#51068; &#53456;&#51648;&#183;&#50836;&#44201; &#45733;&#47141;&#51012; &#44053;&#54868;&#54620;&#45796;.<br><br>&#45236;&#45380;&#48512;&#53552; F35B &#49688;&#51649; &#51060;&#52265;&#47449; &#49828;&#53588;&#49828; &#51204;&#53804;&#44592;&#47484; &#53457;&#51116;&#54624; &#49688; &#51080;&#45716; &#45796;&#47785;&#51201; &#45824;&#54805;&#49688;&#49569;&#54632;(3&#47564;t &#44221;&#54637;&#47784;&#44553;) &#44060;&#45392;&#49444;&#44228;&#50640; &#52265;&#49688;&#54616;&#44256;, &#50976;&#49324;&#49884; &#48513;&#54620; &#51204;&#47141;&#47581;&#51012; &#47924;&#47141;&#54868;&#54624; &#49688; &#51080;&#45716; &#51221;&#51204;&#53444;&#44284; &#51204;&#51088;&#44592;&#54148;&#49828;(EMP)&#53444;&#51012; &#44060;&#48156;&#54620;&#45796;.<br><br>&#51060;&#51452;&#50896; &#44592;&#51088; starjuwon@seoul.co.kr<br><br>&#13;
        &#13;
                &#13;
                   &#13;
                                  &#13;
                &#13;
        &#13;
&#13;
                                     &#13;
                  <p>&#44397;&#48169;&#48512;&#44032; &#48513;&#54620;&#51032; &#53444;&#46020;&#48120;&#49324;&#51068; &#50948;&#54801; &#45824;&#51025;&#44284; &#51109;&#48337; &#48373;&#51648; &#44053;&#54868;&#47484; &#50948;&#54644; 5&#45380;&#44036; 290&#51312;&#50896;&#44032;&#47049;&#51032; &#44397;&#48169;&#48708;&#47484; &#53804;&#51077;&#54616;&#44592;&#47196; &#54664;&#45796;. &#50732;&#54644; 46&#51312; 6000&#50613;&#50896;&#51060;&#45912; &#44397;&#48169; &#50696;&#49328;&#51008; &#45236;&#45380;&#48512;&#53552; &#50672;&#44036; 50&#51312;&#50896;&#51012; &#46028;&#54028;&#54624; &#51204;&#47581;&#51060;&#45796;. 2022&#45380;&#51004;&#47196; &#50696;&#49345;&#46104;&#45716; &#51204;&#49884;&#51089;&#51204;&#53685;&#51228;&#44428; &#51204;&#54872;&#51012; &#54252;&#54632;&#54644; &#50504;&#48372; &#48520;&#50504;&#44048;&#51012; &#54644;&#49548;&#54616;&#47140;&#45716; &#52712;&#51648;&#47196; &#48372;&#51064;&#45796;.&#44397;&#48169;&#48512;&#45716; 14&#51068; &#54693;&#54980; 5&#45380; &#46041;&#50504; &#44400;&#49324;&#47141; &#44148;&#49444;&#44284; &#50868;&#50689; &#44228;&#54925;&#51012; &#45812;&#51008; &#8216;2020~2024 &#44397;&#48169;&#51473;&#44592;&#44228;&#54925;&#8217;&#51012; &#48156;&#54364;&#54664;&#45796;. &#51648;&#45212; 1&#50900; &#48156;&#54364;&#54620; 2019~2023&#45380; &#51473;&#44592;&#44228;&#54925;(270&#51312; 7000&#50613;&#50896;)&#48372;&#45796;&#45716; 19&#51312; 8000&#50613;&#50896;&#51060; &#51613;&#50529;&#46096;&#45796;. &#50864;&#49440; &#48169;&#50948;&#47141; &#44060;&#49440;&#50640; 103&#51312; 8000&#50613;&#50896;&#51012; &#53804;&#51077;&#54620;&#45796;. 2019~2023 &#51473;&#44592;&#44228;&#54925;(94&#51312; 1000&#50613;&#50896;)&#48372;&#45796; 9&#51312; 7000&#50613;&#50896;&#51060; &#45720;&#50632;&#45796;. &#54056;&#53944;&#47532;&#50631;&#44284; &#52384;&#47588;2 &#46321; &#54620;&#44397;&#54805; &#48120;&#49324;&#51068;&#48169;&#50612;(KAMD) &#52404;&#44228;&#51032; &#48169;&#50612; &#51648;&#50669;&#51012; &#54869;&#45824;&#54616;&#44256; &#48120;&#49324;&#51068; &#50836;&#44201; &#45733;&#47141;&#51012; &#45908;&#50865; &#45458;&#51060;&#44256;, &#44400; &#51221;&#52272;&#50948;&#49457;&#51012; &#51204;&#47141;&#54868;&#54616;&#45716; &#46321; &#48513;&#54620;&#51032; &#49888;&#54805; &#53444;&#46020;&#48120;&#49324;&#51068; &#53456;&#51648;&#183;&#50836;&#44201; &#45733;&#47141;&#51012; &#44053;&#54868;&#54620;&#45796;.&#45236;&#45380;&#48512;&#53552; F35B &#49688;&#51649; &#51060;&#52265;&#47449; &#49828;&#53588;&#49828; &#51204;&#53804;&#44592;&#47484; &#53457;&#51116;&#54624; &#49688; &#51080;&#45716; &#45796;&#47785;&#51201; &#45824;&#54805;&#49688;&#49569;&#54632;(3&#47564;t &#44221;&#54637;&#47784;&#44553;) &#44060;&#45392;&#49444;&#44228;&#50640; &#52265;&#49688;&#54616;&#44256;, &#50976;&#49324;&#49884; &#48513;&#54620; &#51204;&#47141;&#47581;&#51012; &#47924;&#47141;&#54868;&#54624; &#49688; &#51080;&#45716; &#51221;&#51204;&#53444;&#44284; &#51204;&#51088;&#44592;&#54148;&#49828;(EMP)&#53444;&#51012; &#44060;&#48156;&#54620;&#45796;.&#51060;&#51452;&#50896; &#44592;&#51088; starjuwon@seoul.co.kr</p></div>

Capture

mercuree commented 5 years ago

Yes you are right. What I mean is on the original webpage there's only one match of "emp". But if you run my code above and print(article.article_html), you will find two matches in the output, which means the content is repeated.

First you can try to apply my fix https://github.com/codelucas/newspaper/pull/456 if it does not help, then it is probably long-story bug, i fixed it in my fork, but not sure if it is made properly.

JohnChu101 commented 5 years ago

That fix didn't help and i tried the three steps provided by you in #141 . and it's solved now. Thanks!

JohnChu101 commented 4 years ago

Yes you are right. What I mean is on the original webpage there's only one match of "emp". But if you run my code above and print(article.article_html), you will find two matches in the output, which means the content is repeated.

First you can try to apply my fix #456 if it does not help, then it is probably long-story bug, i fixed it in my fork, but not sure if it is made properly.

It seems to be removing the space before and after elements. For example, <p>We released <a href="https://www.google.com/" target="_blank">a new video</a> here. <a href="https://www.google.com/" target="_blank">Click here to watch it now</a>.</p> We released a new video here. Click here to watch it now. Will be converted to <p>We released <a href="https://www.google.com/" target="_blank">a new video</a> here. <a href="https://www.google.com/" target="_blank">Click here to watch it now</a>.</p> We releaseda new videohere.Click here to watch it now. Same thing happens with bold texts.