fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.03k stars 420 forks source link

Finished crawling with no results #175

Open tobiasstrauss opened 4 years ago

tobiasstrauss commented 4 years ago

Mandatory

Related issues:

Describe your question The the given CLI example returns no pages from zeit.de. I have the same problems with other web pages. No error is thrown, it just returns and claims to be finished. So the question is if there is a way to approach the problem. I attached the log file. log.txt

Versions (please complete the following information):

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

I train language models for finetuning them on other tasks like ner or text classification

fhamborg commented 4 years ago

Strange, also that there's no error in the log! When not using the CLI mode but the library mode (see readme.md) does the extraction work for you?

tobiasstrauss commented 4 years ago

Acutally not. The problem seems to be that one hast to accept the advertisement popup first. The output was: zeit.de mit Werbung Besuchen Sie zeit.de wie gewohnt mit Werbung und Tracking. Details zum Tracking finden Sie in der Datenschutzerklärung und im Privacy Center . :-/

fhamborg commented 3 years ago

Did I understand you correctly that:

1) when using library mode, e.g., from_url, you retrieve the above text? In which is this set of the NewsArticle object, e.g., title, maintext, etc.?

2) And respectively, when using CLI mode, nothing is returned, not even an (empty) article object?

JermellB commented 3 years ago

I had this problem myself, I am pretty sure I had a configuration issue that was failing silently. I remade my configuration file basing it off of the examples and things seemed to start working. My assumption was some weird python tabs or spaces problem.

tobiasstrauss commented 3 years ago

Did I understand you correctly that:

  1. when using library mode, e.g., from_url, you retrieve the above text? In which is this set of the NewsArticle object, e.g., title, maintext, etc.?
  2. And respectively, when using CLI mode, nothing is returned, not even an (empty) article object?

To 1. exactly! I just asked for maintext and title. To 2. In the CLI mode there is not even a folder referring to zeit.de. Meanwhile I set up a new system with Ubuntu 20.04. Same problem. Also with a new configuration. I just used the configuration given in the example. This is a strange behavior since other pages like faz seem to work perfectly. @fhamborg thanks for sharing this great tool. Although zeit.de is not working for me, I was able to crawl many other pages.

edit: my config file:

{
  # Every URL has to be in an array-object in "base_urls".
  # The same URL in combination with the same crawler may only appear once in this array.
  "base_urls" : [
    {
      # zeit.de has a blog which we do not want to crawl
      "url": "http://www.zeit.de",

      "overwrite_heuristics": {
        # because we do not want to crawl that blog, disable all downloads from
        # subdomains
        "is_not_from_subdomain": true
      },
      # Update the condition as well, all the other heuristics are enabled in
      # newscrawler.cfg
      "pass_heuristics_condition": "is_not_from_subdomain and og_type and self_linked_headlines and linked_headlines"
    }
  ]
}
peterkabz commented 3 years ago

@tobiasstrauss I agree with you, the issue was the website pop up at https://www.zeit.de/zustimmung?url=https%3A%2F%2Fwww.zeit.de%2Findex

woxxel commented 3 years ago

Hey there @tobiasstrauss you're able to bypass the issue by sending the appropriate cookie with the crawl request (cookie named 'zonconsent' - you would have to get the appropriate value of the cookie by visiting the site manually once). I've been implementing a couple of changes including this one, which I could push - though I'm not a 100% sure if there are any legal implications to programmatically bygoing such consent-popups. is anyone more literate on the according legal issues?

SamuelHelspr commented 2 years ago

Hey @woxxel, I am currently experiencing the same issues as @tobiasstrauss. Could you share your approach on how to send the cookie with the crawl request? I tried to implement it myself but failed so far. Thanks a lot!

loughnane commented 1 year ago

@SamuelHelspr or @woxxel have either of you (or anyone reading) figured out how to send a cookie? I've been using the from_url function and it seems there's no option to pass it.

JermellB commented 1 year ago

If no one has figured this out in a week ping me and I'll write a quick patch for you. I was doing some decently large scale crawls with this and to get scale that was something I had to do.

On Tue, Aug 8, 2023 at 9:34 AM Chris Loughnane @.***> wrote:

@SamuelHelspr https://github.com/SamuelHelspr or @woxxel https://github.com/woxxel have either of you (or anyone reading) figured out how to send a cookie? I've been using the from_url function and it seems there's no option to pass it.

— Reply to this email directly, view it on GitHub https://github.com/fhamborg/news-please/issues/175#issuecomment-1669631533, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM63N6EDNYBSEAUAIGYSDDXUI56ZANCNFSM4QLZBU7A . You are receiving this because you commented.Message ID: @.***>

loughnane commented 1 year ago

Hey @JermellB i'd gladly take you up on that patch.

BilalReffas commented 11 months ago

I just made the same experience. Interestingly some sites (guardian, FAZ) are working fine even though there are ads in between.

But for Spiegel the maintext is not being returned at all for most content.

@JermellB any updates from you? Do you need help how you starting this patch?

fhamborg commented 1 month ago

cf. https://github.com/fhamborg/news-please/pull/282