UTMediaCAT / Voyage

Other
12 stars 5 forks source link

optimize warc creation with phamtomjs/wpull #26

Closed kimpham54 closed 8 years ago

kimpham54 commented 9 years ago

Roger and Jai managed to get the PhantomJS/WPull to get the WARC, but there's a problem with the speed Roger/Jai believe that they can make it faster by figuring out what elements to ignore --- here's what Roger writes: "we can manually force it to generate files in 2 mins for each url, and the results can be sill good. However, as a result, it is likely that a few images will be missing.( you can check the attachments to see sample file generated by this strategy)"