codelibs / fess

Fess is very powerful and easily deployable Enterprise Search Server.
https://fess.codelibs.org
Apache License 2.0
993 stars 166 forks source link

Unable to index PDF, Word or Excel files #2164

Closed erbouchard closed 4 years ago

erbouchard commented 5 years ago

I'm trying to figure out. My HTML pages are indexed properly but none of PDF, Word or Excel files.

Using version 12.6.

Crawling parameters

URLs
http://host/NPG/

Included URL For Crawling
http://host/NPG/.*

Excluded URLs For Crawling
.(?i).*(css|js|jpeg|jpg|gif|png|bmp|wmv|exe|mp4)

Included URLs For Indexing
http://host/NPG/.*

Excluded URLs For Indexing
(empty)

Questions

  1. Is this supposed to index those documents (references in <a href="...">...</a>) by default?

  2. Or do I have to configure something?

Thanks

marevol commented 5 years ago

Is this supposed to index those documents (references in ...) by default?

Yes.

Or do I have to configure something?

See fess-crawler.log.

erbouchard commented 5 years ago

Here's my fess-clawler.log for today. No trace of any of those files.

fess-crawler.log

marevol commented 5 years ago

Which page is PDF file linked?