Open facundofarias opened 9 years ago
Hmmm... Interesting idea. I really appreciate the suggestion.
At the moment I'm leaning toward not incorporating this into textract. Considering how easy it is to do something like this from the command line with something like:
#!/bin/bash
for filename in $(find /path/to/some/directory -name '*.html'); do
textract $filename >> output.txt
done
or to do this natively in python with something like glob2, it seems a bit unnecessary to bake this into textract. The goal of this package is to streamline the interface for extracting the raw text from any document type and I'd like to keep this as simple as possible while achieving this goal.
I'll keep this issue open for a while in case others would like to comment on this concept, share other use cases where this would be helpful, or have other ideas for implementation.
In the spirit of the Unix philosophy, I agree with @deanmalmgren on this one. A program should do only one thing, and it is preferable to chain commands together than to add non-essential features to commands.
Ok, I am going to ask a naive question here, and I hope you don't mind enlightening me. I tried your script in a python for loop, and was surprised to find I couldn't make it work. That is what led me here. It is one thing to say some sort of internal for loop is 'extra', I get that, but why doesn't it work in a regular Python for loop? That I don't get. Of course, it is entirely possible I just did it wrong. Nah, that can't be it. But your bash loop does work. Same for all the output going to a single file, instead of one output file for each input file, but that part I was able to figure out. Thanks for sharing your insight, wisdom and experience with me!
@malikrumi can you provide an example. A python for
loop should work just fine...
` from os import listdir, environ import textract import django environ['DJANGO_SETTINGS_MODULE'] = 'chronicle.settings' django.setup() from ktab.models import Entry
path = '/home/malikarumi/010417_odt_tests/' filenames = listdir(path)
for filename in filenames: text = textract.process(filename, encoding='utf_8') text.write(Entry.objects.create( title=filename, content=text, chron_date='2018-01-05', clock='23:59:59', tag__tag='tagg')) text.save() ` The code backticks seem not to be working for me.
(lifeandtimes) malikarumi@Tetuoan2:~/Projects/lifeandtimes/chronicle$ python django_textract_2.py
Traceback (most recent call last):
File "django_textract_2.py", line 15, in
Now, if the file can't be found, how does python know the name of it? This script uses variables for file names in the expectation that it will iterate over all of them. I don't know what additional change I am supposed to make so that textract / Python can 'see' the file.
Note the tag insert should be changed to be directly into the Tag model, not into Entry.
Thanks.
As far as I can see, a directory is not a option while choosing a file.
That would be nice feature, something like:
This would be very helpful while translating for example web sites, and extracting all the strings to a resource file. We can discuss about it, if you think it makes sense.
Thanks