codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

Problem with news.yahoo.com #118

Closed rdrgporto closed 8 years ago

rdrgporto commented 8 years ago

Hi,

I have a problem with example news.yahoo.com. I use a proxy and authentication,however it works with example http://www.codelibs.org/ and http://fess.codelibs.org/. I think that maybe it be an issue with https, this crawler has support?

I have written my configuration in order to you take a look:

curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "index" : "webindex",
    "type" : "my_web",
    "urls" : ["http://news.yahoo.com/"],
    "include_urls" : ["http://news.yahoo.com/.*"],
    "max_depth" : 3,
    "max_access_count" : 100,
    "num_of_thread" : 5,
    "interval" : 1000,
    "proxy" : {
          "host" : "host_proxy",
          "port" : 80
        },
    "authentications":[
     {
    "scope": {
      "scheme":"BASIC"
    },
    "credentials": {
      "domain":"domain",
      "username":"user",
      "password":"pass"
    }
  }],
    "target" : [
      {
        "pattern" : {
          "url" : "http://news.yahoo.com/.*",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          },
          "body" : {
            "text" : "body"
          },
          "bodyAsHtml" : {
            "html" : "body"
          },
          "projects" : {
            "text" : "ul.nav-list li a",
            "isArray" : true
          }
        }
      }
    ]
}'

In the log, I look the following error:

2016-05-13 12:05:59,134 [Crawler-a252db01-4c8c-4b9e-8687-1ea8a3463cc1-1] INFO  Crawling URL: http://news.yahoo.com/
2016-05-13 12:05:59,196 [Crawler-a252db01-4c8c-4b9e-8687-1ea8a3463cc1-1] INFO  Checking URL: http://news.yahoo.com/robots.txt
2016-05-13 12:05:59,837 [Crawler-a252db01-4c8c-4b9e-8687-1ea8a3463cc1-1] INFO  Redirect to URL: http://news.yahoo.com/

Thanks in advance,

Regards

rdrgporto commented 8 years ago

Hi again,

It works with following configuration:

curl -XPUT 'localhost:9200/.river_web/config/yahoo_site' -d '{
    "index" : "webindex",
    "type" : "my_web",
    "urls" : ["https://es-us.noticias.yahoo.com/"],
    "include_urls" : ["https://es-us.noticias.yahoo.com/.*"],
    "max_depth" : 1,
    "max_access_count" : 10,
    "num_of_thread" : 3,
    "interval" : 3000,
    "user_agent" : "Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",
    "proxy" : {
          "host" : "host_proxy",
          "port" : 80
        },
    "authentications":[
     {
    "scope": {
      "scheme":"BASIC"
    },
    "credentials": {
      "domain":"domain_name",
      "username":"user",
      "password":"pass"
    }
  }],
    "target" : [
      {
        "pattern" : {
          "url" : "https://es-us.noticias.yahoo.com/",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          },
          "body" : {
                "text" : "body",
                "trimSpaces" : true
            }
        }
      }
    ]
}'

Regards :smile: