documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
831 stars 214 forks source link

how can i do multiple pdf extraction processes concurrently? #53

Closed quyen closed 11 years ago

quyen commented 12 years ago

I'd like to be able to extract pdf concurently, but it is not possible with docsplit gem I tried to extract 2 ppt files to pdf, the gem fails to process. The code is as below, please replace path_to_docsplit.rb, path_to_test_file1.ppt, path_to_test_file2.ppt

Im looking forward to your answer. Thank you, Quyen

!/usr/bin/ruby

require 'path_to_docsplit.rb'

def extraction(path_to_file)
Docsplit.extract_pdf(path_to_file) end

puts('start extraction') t1=Thread.new{extraction('path_to_test_file1.ppt')} t2=Thread.new{extraction('path_to_test_file2.ppt')} t1.join t2.join puts('end extraction')

Natim commented 12 years ago

We are using a redis queue with circus to lauch X workers. And it works fine.

http://redis.io/ http://circus.readthedocs.org/

knowtheory commented 11 years ago

Hey @quyen can you be a little more specific about what errors you're encountering?

DocumentCloud uses Docsplit in a manner similar to what @Natim outlines.

avlakin commented 11 years ago

@knowtheory & @Natim - I'm trying to do the same thing as Quyen, but having some trouble figuring out Circus..

Would you guys happen to know of any tutorial covering the set-up for using Circus to run multiple processes?

Thanks in advance.

Natim commented 11 years ago

@antonlakin Have a look at thoose projects : https://github.com/novagile/insight-reloaded and https://github.com/novagile/insight-installer there is some configuration example : https://github.com/novagile/insight-installer/tree/master/chef/cookbooks/insight/templates/default

knowtheory commented 11 years ago

Just for some additional details, DocumentCloud uses CloudCrowd for distributed queuing of jobs which use DocSplit. You can check out the actions we've written, and in particular note the document_import action.