documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
832 stars 214 forks source link

Docsplit::TextExtractor#extract_text should return the path of the output text file? #139

Open nruth opened 8 years ago

nruth commented 8 years ago

related to https://github.com/documentcloud/docsplit/issues/42

After extracting the text from a PDF or Doc file I need to do something with it. I understand not loading the string into ruby (it could be huge), but it'd be helpful to get the output file path as a return value. Otherwise we have to use different output dirs or try to reconstruct its path based on other information, which feels wrong.

Currently Docsplit::TextExtractor#extract_text is returning the source file paths. For Transparent doc(x) file conversion it returns the intermediary tempfile pdf. E.g. when I map over an array with a pdf and a doc in my project's tmp dir I get back

[
"/var/folders/_j/q3pr8b3s1vj85mhqvyb06gr40000gn/T/docsplit/sample.docx20160125-29577-go3upi.pdf",
"/Users/nruth/dev/monitor/tmp/AISB08.pdf20160125-29577-1svhpfo.pdf"
]

Instead I'd like to be given the path of the output text files, so I can open them.

Would this be a good PR, or is there a deliberate reason to return these other file paths that could be documented?

harssh commented 8 years ago

:+1: Are we going ahead with this or is this already implemented ?

nruth commented 8 years ago

I didn't make a PR. I worked around the problem by putting the document into its own temporary subdirectory then using ls. I do think it's something that can be fixed, as it's just a forgot-to-think-about-the-return-value problem. But the PR backlog is growing.