deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.91k stars 608 forks source link

add ability to specify custom extraction methods across different file types. #140

Open deanmalmgren opened 7 years ago

deanmalmgren commented 7 years ago

@barrust brought this up in #122

When iterating over a large number of files, it is difficult to specify non-standard method kwargs for different filetypes. For example, currently method is used for PDFs and engine is used for audio files:

for filename in dir:
   txt = textract.process(filename, method='tesseract', engine='sphinx' )
   print(txt)

I personally like the simplicity of always having a method kwarg for the textract.process function, but what if we gave users the ability to test what extension a file has before it is processed so they can easily handle PDFs vs audio files, for example. I'm thinking of something like this:

for filename in filenames:
    ext = textract.get_extension(filename)
    if ext == 'pdf':
        kwargs = {'method': 'tesseract'}
    elif ext.is_audio:
        kwargs = {'method': 'sphinx'}
    txt = textract.process(filename, **kwargs)
    print(txt)

Another approach would be to turn the method kwarg to also accept a dictionary:

methods = {
    'pdf': 'tesseract',
    'audio': 'sphinx',
}
for filename in filenames:
    txt = textract.process(filename, method=methods)
    print(txt)
deanmalmgren commented 7 years ago

Another issue here is thinking about how we will deal with this on the command line...

barrust commented 7 years ago

What about making it a comma delimited list? Each function with multiple extraction methods would have to handle splitting it but it could solve the CLI issue.

for filename in filenames:
    txt = textract.process(filename, method='tesseract,sphinx')
    print(txt)

I would assume that the CLI would require that there be no spaces between different methods.

I also like the other proposed ideas but not sure how they would work with the CLI. The idea of getting back extensions would be great!

deanmalmgren commented 7 years ago

Below are a couple of ideas for the CLI. There are probably others (please share if you have ideas!), but I think I have a slight preference for the configuration file approach. It gives textract the option to specify lots of kwargs at the same time to textract.process. Then, instead of overloading the method kwarg to textract.process, we can use a configuration object to override defaults.

Thoughts and feedback welcome! This is definitely a major change to the UI and would warrant a major version bump to 2.0.0 so I want to make sure we get this right.

hyphenated command line arg

for f in directory/*; do
    textract --method-pdf tesseract --method-audio sphinx $f
done

colon-ized command line value

for f in directory/*; do
    textract --method pdf:tesseract --method audio:sphinx $f
done

json command line value

for f in directory/*; do
    textract --method '{"pdf":"tesseract","audio":"sphinx"}' $f
done

conf file

for f in directory/*; do
    textract --conf textract.conf $f
done
barrust commented 7 years ago

I like the configuration file option better than having to form valid json! The other two options of colons or hyphens are both good but I think the configuration file will likely be more future proof.

deanmalmgren commented 7 years ago

Thanks for the input @barrust

INI format? YAML?

I think I have a small preference for YAML but I welcome arguments from others.

If I have time, I may try to mock this up on my flight back to Chicago on Friday. Sounds fun :)

deanmalmgren commented 7 years ago

I'm leaning toward INI format with this so people can set it in their project's setup.cfg. I also think we could probably address #96 at the same time which would be very nice.

filipopo commented 4 years ago

It's been a long time but this works for now if anyone's interested and checking this out:

from os.path import splitext
from textract import process

switcher = {
    "pdf": "pdfminer",
    "mp3": "SpeechRecognition"
}

filenames=["hoho.txt", "asdf.pdf", "example.mp3"]
for filename in filenames:
    ext = splitext(filename)[1][1:]
    method = switcher.get(ext, "")
    text = process(filename, method=method)