Closed rdrgporto closed 8 years ago
Hi again,
It works with following configuration:
curl -XPUT 'localhost:9200/.river_web/config/yahoo_site' -d '{
"index" : "webindex",
"type" : "my_web",
"urls" : ["https://es-us.noticias.yahoo.com/"],
"include_urls" : ["https://es-us.noticias.yahoo.com/.*"],
"max_depth" : 1,
"max_access_count" : 10,
"num_of_thread" : 3,
"interval" : 3000,
"user_agent" : "Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",
"proxy" : {
"host" : "host_proxy",
"port" : 80
},
"authentications":[
{
"scope": {
"scheme":"BASIC"
},
"credentials": {
"domain":"domain_name",
"username":"user",
"password":"pass"
}
}],
"target" : [
{
"pattern" : {
"url" : "https://es-us.noticias.yahoo.com/",
"mimeType" : "text/html"
},
"properties" : {
"title" : {
"text" : "title"
},
"body" : {
"text" : "body",
"trimSpaces" : true
}
}
}
]
}'
Regards :smile:
Hi,
I have a problem with example news.yahoo.com. I use a proxy and authentication,however it works with example http://www.codelibs.org/ and http://fess.codelibs.org/. I think that maybe it be an issue with https, this crawler has support?
I have written my configuration in order to you take a look:
In the log, I look the following error:
Thanks in advance,
Regards