issues
search
RichJackson
/
cogstack
Database - Elasticsearch realtime mapping. With NLP goodiness.
Apache License 2.0
7
stars
2
forks
source link
Improve tika for scan documents in PDF formats
#18
Closed
hkkenneth
closed
7 years ago
hkkenneth
commented
8 years ago
Use tika-config.xml explicitly
Store original text parsed from PDF (before OCR) in metadata
Change image magick command (which increase successful parsing rate)