codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

Elastic Search Full Text Search with Rivers Plugin not Indexing URLs that have # (e.g. Angular Partials/Routing) #66

Open fgump2013 opened 9 years ago

fgump2013 commented 9 years ago

Can you guys please look at this issue I posed on github, it seems like a bug with elastic search rivers web plugin that it doesn't handle the angularjs partial urls.

http://stackoverflow.com/questions/25878406/elastic-search-full-text-search-with-rivers-plugin-not-indexing-urls-that-have

Full content of the github post:

Using Elastic Search and ElasticSearch River Web plugin, I am trying to build a full text search index for a static website that's built with AngularJS (with partials/routing etc.)

The sample angular app gist is here: https://gist.github.com/fgump2013/c4d1e97fd8859e2c761d

Here's how I build the index:

PUT _river/angrouting/_meta { "type" : "web", "crawl" : { "index" : "webindex", "url" : ["https://myurl.com/angrouting/"], "includeFilter" : ["https://myurl.com/angrouting/."], "maxDepth" : 5, "maxAccessCount" : 100, "numOfThread" : 5, "interval" : 1000, "target" : [{ "pattern" : { "url" : "https://myurl.com/angrouting/.", "mimeType" : "text/html" }, "properties" : { "title" : { "text" : "title" }, "body" : { "text" : "body", "trimSpaces" : true } } }] } } Running the text search on "hello", i.e. content from index.html renders fine:

POST /webindex/_search { "query": { "query_string": { "query": "hello" } } } Output:

{ "took": 0, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.047945753, "hits": [ { "_index": "webindex", "_type": "angrouting", "_id": "JEMExhnfQx29GLAngsKS1w", "_score": 0.047945753, "_source": { "method": "GET", "lastModified": "2014-09-16T18:54:25.000Z", "url": "https://myurl.com/angrouting/", "contentLength": 685, "httpStatusCode": 200, "charSet": "UTF-8", "mimeType": "text/html", "executionTime": 315, "title": "AngularJS Simple Routing", "body": "Header 1 Hello World!! Sub Page 1 Sub Page 2", "@timestamp": "2014-09-16T19:12:12.194Z" } } ] } } However, searching anything in the subpages, doesn't seem to be working. Elastic Search rivers-web doesn't seem to be indexing the subpages that are routed with angular.

POST /webindex/_search
{
    "query": {
        "query_string": {
            "query": "Content2"
        }
    }
}

Anyone else faced the same issue with building full text search solution using Elastic Search? If I use absolute URLs in the index.html, then indexer picks it up, but I can't use absolute urls, since it's an angular app and I want to use the partials.

Any help is greatly appreciated!

Thanks!

marevol commented 9 years ago

At the moment, River Web does not execute javascript on a crawling.