bejean / crawl-anywhere

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
www.crawl-anywhere.com
Apache License 2.0
96 stars 38 forks source link

add DjVu support #7

Open ghost opened 11 years ago

ghost commented 11 years ago

Can you please add DjVu indexing support? There is a tool like pdftotext available for djvu files: http://djvu.sourceforge.net/doc/man/djvutxt.html I like crawl anywhere, because it is super fast. Sadly I'm not able to add djvu support by myself as I do not understand Java.

bejean commented 11 years ago

Thank you for the "I like crawl anywhere, because it is super fast" :) Can you provide some urls with djvutxt content ?

bejean commented 11 years ago

Not easy to find good files for testing. The best is to provide me some djvu files which produce text with djvutxt utility.

ghost commented 11 years ago

Thanks for your fast response. Here is a djvu file with a hidden text layer that can be extracted: http://www.djvuzone.org/support/results.djvu I can look for better examples, if this is not a good file for testing.

bejean commented 11 years ago

I please, I need some good files for tests.

ghost commented 11 years ago

Did you get my E-Mail with additional links to DjVu files?

bejean commented 11 years ago

yes, thank you