dachcom-digital / pimcore-lucene-search

Pimcore Website Indexer (powered by Zend Search Lucene)
Other
26 stars 20 forks source link

Pages get indexed twice #82

Closed GALCF closed 5 years ago

GALCF commented 5 years ago
Q A
Bug report? yes
Feature request? no
BC Break report? no
RFC? no

Request to reopen #64

Every search result appears twice no matter the configuration. I tried crawling with www., without www. and could not find any redirect errors in the verbose log.
The log file might suggest that it crawls twice, but then again I don't know enough about the crawler.

My configuration looks like this:

lucene_search:
  enabled: true
  fuzzy_search_results: false
  search_suggestion: true
  seeds:
    - 'http://domain.local'
  filter:
    valid_links:
      - '@^http://domain.local.*@i'
  allowed_schemes:
    - 'http'
  view:
    max_per_page: 10
  crawler:
    content_max_size: 4
    content_start_indicator: '<!-- main-content -->'
    content_end_indicator: '<!-- /main-content -->'
    content_exclude_start_indicator: '<!-- indexer-ignore -->'
    content_exclude_end_indicator: '<!-- /indexer-ignore -->'

My verbose log file:

$ vagrant ssh -c "/home/vagrant/pimcore/bin/console lucenesearch:crawl -f -vvv"

17:24:38 DEBUG     [pimcore] PHP garbage collector collected 0 cycles
17:24:38 DEBUG     [pimcore] LuceneSearch: Stopping crawl
17:24:38 DEBUG     [pimcore] LuceneSearch: Reset Genesis Index
17:24:38 DEBUG     [pimcore] LuceneSearch: Reset Persistence Store
17:24:38 DEBUG     [pimcore] LuceneSearch: Reset Asset Tmp
17:24:38 DEBUG     [pimcore] LuceneSearch: Reset Logs
17:24:38 DEBUG     [pimcore] LuceneSearch: Starting crawl
17:24:39 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local
17:24:40 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/de
17:24:40 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.match.invalid.filtered] --REDACTED INFO--
17:24:40 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.match.invalid.filtered] --REDACTED INFO--
17:24:40 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.match.invalid.filtered] --REDACTED INFO--
17:24:40 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.match.invalid.filtered] --REDACTED INFO--
17:24:40 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.match.invalid.filtered] --REDACTED INFO--
17:24:40 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/
17:24:42 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/de/kontakt
17:24:43 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/en/contact
17:24:45 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/en/service
17:24:46 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/de/service
[...]
17:25:15 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/de/shop
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/en
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.crawler] enqueued links: 27
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.crawler] skipped links: 5
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.crawler] failed links: 0
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.crawler] persisted links: 27
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.crawler] memory peak usage: 34.75MB
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.crawler] total time: 00:38
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.crawler] politeness wait time: 0.52 seconds
17:25:17 DEBUG     [php] Warning: DOMDocument::loadHTML(): Document is empty in Entity, line: 3
[
  "exception" => Symfony\Component\Debug\Exception\SilencedErrorContext {#1
    +count: 1
    -severity: Symfony\Component\Debug\Exception\SilencedErrorContext {#1}
    trace: Symfony\Component\Debug\Exception\SilencedErrorContext {#1}
  }
]
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/de
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/de/shop
[...]
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/en
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/en/shop
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.parser] skip indexing [ http://domain.local/ ] because of wrong status code [ 302 ]
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.parser] closed frontend index references
17:25:17 DEBUG     [pimcore] LuceneSearch: [task.parser] optimize lucene index
17:25:18 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local
17:25:19 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/de
17:25:20 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/
17:25:21 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/de/kontakt
17:25:23 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/en/contact
17:25:24 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/en/service
17:25:25 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/de/service
[...]
17:25:38 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/en/shop
17:25:54 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/de/shop
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.crawler] [spider.uri.crawled] http://domain.local/en
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.crawler] enqueued links: 27
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.crawler] skipped links: 0
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.crawler] failed links: 0
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.crawler] persisted links: 27
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.crawler] memory peak usage: 34.75MB
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.crawler] total time: 00:38
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.crawler] politeness wait time: 0.52 seconds
17:25:56 DEBUG     [php] Notice: Undefined property: LuceneSearchBundle\Task\Parser\ParserTask::$index
[
  "exception" => Symfony\Component\Debug\Exception\SilencedErrorContext {#1
    +count: 1
    -severity: Symfony\Component\Debug\Exception\SilencedErrorContext {#1}
    trace: Symfony\Component\Debug\Exception\SilencedErrorContext {#1}
  }
]
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/de
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/de/kontakt
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/de/shop
[...]
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/en/contact
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/en
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] added html to indexer stack: http://domain.local/en/shop
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] skip indexing [ http://domain.local/ ] because of wrong status code [ 302 ]
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] closed frontend index references
17:25:56 DEBUG     [pimcore] LuceneSearch: [task.parser] optimize lucene index
17:25:56 DEBUG     [pimcore] LuceneSearch: Reset Persistence Store
17:25:57 DEBUG     [pimcore] LuceneSearch: Reset Uri Filter Persistence Store
17:25:57 DEBUG     [pimcore] LuceneSearch: Reset Asset Tmp
17:25:57 DEBUG     [pimcore] LuceneSearch: Remove Queued Document Modifiers
17:25:57 DEBUG     [pimcore] LuceneSearch: Stopping crawl
GALCF commented 5 years ago

Already solved my issue, the problem was a merge of yml-files that somehow appended the crawl-seeds instead of replacing the first one set.

Sorry for the confusion