deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.91k stars 606 forks source link

Textract should allow directories as a supported file type #86

Open facundofarias opened 9 years ago

facundofarias commented 9 years ago

As far as I can see, a directory is not a option while choosing a file.

That would be nice feature, something like:

  1. Indicate a directory with some extensions to match,
  2. Search recursively in the directory for those files,
  3. Finally, extract the content to a single file.

This would be very helpful while translating for example web sites, and extracting all the strings to a resource file. We can discuss about it, if you think it makes sense.

Thanks

deanmalmgren commented 9 years ago

Hmmm... Interesting idea. I really appreciate the suggestion.

At the moment I'm leaning toward not incorporating this into textract. Considering how easy it is to do something like this from the command line with something like:

#!/bin/bash
for filename in $(find /path/to/some/directory -name '*.html'); do
    textract $filename >> output.txt
done

or to do this natively in python with something like glob2, it seems a bit unnecessary to bake this into textract. The goal of this package is to streamline the interface for extracting the raw text from any document type and I'd like to keep this as simple as possible while achieving this goal.

I'll keep this issue open for a while in case others would like to comment on this concept, share other use cases where this would be helpful, or have other ideas for implementation.

ShawnMilo commented 9 years ago

In the spirit of the Unix philosophy, I agree with @deanmalmgren on this one. A program should do only one thing, and it is preferable to chain commands together than to add non-essential features to commands.

MalikRumi commented 6 years ago

Ok, I am going to ask a naive question here, and I hope you don't mind enlightening me. I tried your script in a python for loop, and was surprised to find I couldn't make it work. That is what led me here. It is one thing to say some sort of internal for loop is 'extra', I get that, but why doesn't it work in a regular Python for loop? That I don't get. Of course, it is entirely possible I just did it wrong. Nah, that can't be it. But your bash loop does work. Same for all the output going to a single file, instead of one output file for each input file, but that part I was able to figure out. Thanks for sharing your insight, wisdom and experience with me!

deanmalmgren commented 6 years ago

@malikrumi can you provide an example. A python for loop should work just fine...

MalikRumi commented 6 years ago

` from os import listdir, environ import textract import django environ['DJANGO_SETTINGS_MODULE'] = 'chronicle.settings' django.setup() from ktab.models import Entry

path = '/home/malikarumi/010417_odt_tests/' filenames = listdir(path)

for filename in filenames: text = textract.process(filename, encoding='utf_8') text.write(Entry.objects.create( title=filename, content=text, chron_date='2018-01-05', clock='23:59:59', tag__tag='tagg')) text.save() ` The code backticks seem not to be working for me.

(lifeandtimes) malikarumi@Tetuoan2:~/Projects/lifeandtimes/chronicle$ python django_textract_2.py Traceback (most recent call last): File "django_textract_2.py", line 15, in text = textract.process(filename, encoding='utf_8') File "/home/malikarumi/Projects/lifeandtimes/lib/python3.6/site-packages/textract/parsers/init.py", line 39, in process raise exceptions.MissingFileError(filename) textract.exceptions.MissingFileError: The file "2018-01-01_psycopg2-error-at-or-near.odt" can not be found. Is this the right path/to/file/you/want/to/extract.odt?

Now, if the file can't be found, how does python know the name of it? This script uses variables for file names in the expectation that it will iterate over all of them. I don't know what additional change I am supposed to make so that textract / Python can 'see' the file.

Note the tag insert should be changed to be directly into the Tag model, not into Entry.

Thanks.