feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
206 stars 34 forks source link

Spiegel.de/einestages #64

Closed radlerandi closed 6 years ago

radlerandi commented 7 years ago

seit einiger Zeit schleichen sich bei spiegel.de/einestages diverse Blöcke Javascript-Code in den Text mit ein. Hat mich nicht gestört, bis ich eben merkte, dass nach einem davon der Text abbrach und noch etlicher weiterer auf der Webseite zu sehen war. Beispiel: http://www.spiegel.de/einestages/militaerputsch-in-griechenland-1967-als-die-junta-die-demokratie-toetete-a-1143692.html#ref=rss Nach dem Zitatblock mit dem letzten Satz "Kritik am Regime wurde durch Pressezensur erstickt." folgt noch etwas Javascript bis zu '+c.msg+' und dann Schluss, genau so wie wenn ich Feediron-Testing alle cleanups rausnehme, und nur "xpath": "div[@id='content-main']" abhole, dann bricht er an der gleichen Stelle ab. Vermutlich weil direkt dahinter 2x DIV geschlossen werden. Hat jemand eine Idee, wie man das korrigieren kann? Mit cleanup "skript" auch kein Effekt.

dugite-code commented 6 years ago

For some time now, various blocks of JavaScript code sneak into the text at spiegel.de/einestages. Did not bother me until I just realized that after one of them the text broke off and many more could be seen on the website. Example: http://www.spiegel.de/einestages/militaerputsch-in-griechenland-1967-als-the-junta-die-demokratie-toetete-a-1143692.html#ref=rss After the citation block with the last sentence "criticism of the regime was stifled by press censorship." follow some Javascript up to '+ c.msg +' and then end, just like when I take out Feediron-Testing all cleanups, and just "xpath": "div [@ id = 'content-main']" pick up, then it breaks off in the same place. Probably because directly behind it are closed 2x DIV. Does anyone have an idea how to correct that? With cleanup "script" also no effect.

You can select the script contents using "script[contains(@data-aid,'dialog-js')]\/text()"

I also had to add "a[contains(@class,'social')]" or the whole article would be a link caused be a similar issue to #22

{
    "type": "xpath",
    "xpath": "div[@id='content-main']",
    "cleanup": [
        "a[contains(@class,'social')]",
        "script[contains(@data-aid,'dialog-js')]\/text()",
        "div[@id='footer']",
        "div[@id='spVeeseo']",
        "div[@id='js-article-comments-box-form']",
        "div[@class='spTopicMultimedia']",
        "div[@class='article-copyright']",
        "div[@class='home-link-box']",
        "div[@class='module-title']",
        "div[@class='module-subtitle']",
        "div[@class='blog-pager']",
        "div[@class='top-anchor']",
        "div[contains(@class, 'asset-box')]",
        "div[contains(@class, 'function-box')]",
        "div[contains(@class, 'article-newsfeed-box')]",
        "div[contains(@class, 'js-article-post-teaser')]",
        "div[contains(@class, 'social-media-box')]",
        "div[contains(@class, 'top-anchor')]",
        "div[contains(@class, 'home-link-box')]",
        "ul[@class='article-function-box']"
    ]
}
radlerandi commented 6 years ago

thanks! your 2 lines improve "Spiegel/einestages" for me from nothing at the moment to 1/3 of the page. Even with all your cleanups it stops now before the line "Eine heikle Stellungnahme". Do you get more text? edit: i figured out, it is the same javascript problem again. "script[contains(@data-aid,'dialog-js')]\/text()", filters out the first occurance of <script data-aid="html_127371_dialog-js" type="text/javascript"> but not the second..

dugite-code commented 6 years ago

Ok I've done some more investigation,

Turns out there are some issues with this page with unclosed tags. I added html tidy to the get_content and this fixed all the issues. I'll add a "tidy-source":true option

dugite-code commented 6 years ago

I've added the "tidy-source":true option to dev and am currently testing for any gotcha's. I'll merge it into master in a day or so https://github.com/feediron/ttrss_plugin-feediron/commit/cb72535c0d1d85582d4ad9e9b610cc2659b601e2