Open deanmalmgren opened 7 years ago
Another issue here is thinking about how we will deal with this on the command line...
What about making it a comma delimited list? Each function with multiple extraction methods would have to handle splitting it but it could solve the CLI issue.
for filename in filenames:
txt = textract.process(filename, method='tesseract,sphinx')
print(txt)
I would assume that the CLI would require that there be no spaces between different methods.
I also like the other proposed ideas but not sure how they would work with the CLI. The idea of getting back extensions would be great!
Below are a couple of ideas for the CLI. There are probably others (please share if you have ideas!), but I think I have a slight preference for the configuration file approach. It gives textract the option to specify lots of kwargs at the same time to textract.process
. Then, instead of overloading the method
kwarg to textract.process
, we can use a configuration object to override defaults.
Thoughts and feedback welcome! This is definitely a major change to the UI and would warrant a major version bump to 2.0.0
so I want to make sure we get this right.
for f in directory/*; do
textract --method-pdf tesseract --method-audio sphinx $f
done
for f in directory/*; do
textract --method pdf:tesseract --method audio:sphinx $f
done
for f in directory/*; do
textract --method '{"pdf":"tesseract","audio":"sphinx"}' $f
done
for f in directory/*; do
textract --conf textract.conf $f
done
I like the configuration file option better than having to form valid json! The other two options of colons or hyphens are both good but I think the configuration file will likely be more future proof.
Thanks for the input @barrust
INI format? YAML?
I think I have a small preference for YAML but I welcome arguments from others.
If I have time, I may try to mock this up on my flight back to Chicago on Friday. Sounds fun :)
I'm leaning toward INI format with this so people can set it in their project's setup.cfg
. I also think we could probably address #96 at the same time which would be very nice.
It's been a long time but this works for now if anyone's interested and checking this out:
from os.path import splitext
from textract import process
switcher = {
"pdf": "pdfminer",
"mp3": "SpeechRecognition"
}
filenames=["hoho.txt", "asdf.pdf", "example.mp3"]
for filename in filenames:
ext = splitext(filename)[1][1:]
method = switcher.get(ext, "")
text = process(filename, method=method)
When iterating over a large number of files, it is difficult to specify non-standard
method
kwargs for different filetypes. For example, currentlymethod
is used for PDFs andengine
is used for audio files:I personally like the simplicity of always having a
method
kwarg for thetextract.process
function, but what if we gave users the ability to test what extension a file has before it is processed so they can easily handle PDFs vs audio files, for example. I'm thinking of something like this:Another approach would be to turn the
method
kwarg to also accept a dictionary: