deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

.djvu support #34

Open vi opened 10 years ago

vi commented 10 years ago

There can be text information embedded in djvu documents.

deanmalmgren commented 10 years ago

Great idea; know of any utilities that parse .djvu?

vi commented 10 years ago

Maybe djvutxt?

DJVUTXT(1)                                                        DjVuLibre-3.5                                                        DJVUTXT(1)

NAME
       djvutxt - Extract the hidden text from DjVu documents.

SYNOPSIS
       djvutxt [options] inputdjvufile [outputtxtfile]

If there is no text chunk in djvu, the same OCR approach as with other pictures can be applied.

deanmalmgren commented 10 years ago

Sounds great. Want to work up a pull request? The doc_parser.py module should be a good starting point for this functionality.

deanmalmgren commented 10 years ago

I'm not sure if you've gotten started on this yet, but I just thought I'd mention that I merged #39 which now switches to using a class-based set of parsers instead of the function based stuff that existed in v0.5.1. Have a look at the current parsers—especially textract.parsers.doc_parser—and let me know if you have any questions!

vi commented 10 years ago

Not started, just keeping this tab open; Feel free to begin the development of this feature yourself.

I'll write here if/when I decide to try implementing a patch for this.

(actually I haven't cloned "textract" yet)

deanmalmgren commented 10 years ago

OK, sounds good. I've never come across .djvu files myself so I probably won't develop it anytime soon. Its a great idea though.