documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.com/docsplit/
Other
833 stars 214 forks source link

Add layout option to keep layout during text extraction #132

Closed scarfacedeb closed 7 years ago

scarfacedeb commented 9 years ago

It allows to preserve layout of pdf in text.

I noticed that pdftotext produces better results with -layout option with some pdf files. (e.g. table of contents look a lot better and closer to the original markup)

At first, I implemented passing of random options to pdftotext command, but later realised that I only need it for -layout option.

scarfacedeb commented 8 years ago

@knowtheory any news on this one?

alexandremello commented 8 years ago

I need this feature too.

Is there anybody here?

scarfacedeb commented 7 years ago

Unfortunately it seems like this project is dead

knowtheory commented 7 years ago

Well that's one way to get my attention (which i probably shouldn't encourage).

Thank you very much for the commit and the extraction here :)

scarfacedeb commented 7 years ago

@knowtheory That was a pleasant surprise 😄 Thank you!

knowtheory commented 7 years ago

You're welcome! Happy to talk about other issues over the short term, and things to tackle too :)

scarfacedeb commented 7 years ago

Do you have something in mind? 🤔

Btw, could you release new version to rubygems? It'll be great to get rid of the ugly github: dependency in gemfile