Scrape .docs with bill projects

DemocracyOS / bill-scraper

Bill scraper for feeding the DemocracyOS platform

8 stars 4 forks source link

Scrape .docs with bill projects #6

Closed gvilarino closed 10 years ago

gvilarino commented 11 years ago

Here you can find two .docs, one with the originally presented bill project (like the ones you can scrape from CEDOM) and a Despacho, which is the final version that ACTUALLY got treated by congressmen in the recinct.

So, we need to be able to turn the latter into HTML (https://crocodoc.com/ seems a fine tool to do so) and scrape them back into our platform.

gvilarino commented 11 years ago

So, after a lot of research, trial and error, we've reached that crocodoc isn't useful for us. So what we'll do is the following:

Use either the OpenOffice command line tool or unoconv to convert Despachos from .doc/.docx to HTML, scrapable files (i'd rather we used unoconv since it supports both OpenOffice and LibreOffice formats, and it's a specific command-line tool).
Scrape the resulting HTML Despacho with noodle into a DemocracyOS-compatible JSON structure.
Persist the resulting JSON in mongo.

There you go, @ultraklon

ultraklon commented 11 years ago

I finally made this work with Libre Office, using the following command line soffice --headless --convert-to htm:html --outdir ./ despacho.doc Nothing special, I just tried it again, seems to need the program (LibreOffice) to be closed for this to work

gvilarino commented 11 years ago

Will this work server side?

ultraklon commented 11 years ago

I don't know what server are we using. What server are we using?

ultraklon commented 11 years ago

I mean, we need a place to have a copy of LibreOffice and execute it somehow with parameters and disk access

gvilarino commented 11 years ago

No, we can't guarantee to have disk access in a server running DemocracyOS.

For the time being make it be HTML and scrape it locally. Even if we have to upload the resulting JSON by hand, I'd rather have that than no process at all. We could then add a script that does it all from a single command.

I guess we could upload the resulting HTML to an accesible URL (GDrive, Dropbox, whatever) and have the scraper run server side. Still, following @jazzido's advice, it's better to have it run locally, halting on ALL errors and ensuring things get scraped the right way, than trusting too much on an automated solution that fails silently and messes up our data.

gvilarino commented 11 years ago

Anyway :shipit:

ultraklon commented 11 years ago

Got it, I'm thinking about using GDrive API to convert docs. I'll proceed with Noodle, w/o caring about how to receibe html docs (yet)

gvilarino commented 10 years ago

@ultraklon , @oscarguindzberg will be uploading newly obtained data files to @DemocracyOS 's beta app, he'll be getting in touch with you for some assistance.

BTW: where's the code for converting .docs to .htmls through Google's GDrive API?

gvilarino commented 10 years ago

I'm closing this as it's now followed by #10