OpenBudget / open-budget-frontend

Israeli budget web apps
19 stars 32 forks source link

adding 2 scrapers to the pipeline #380

Open VitalyLub opened 8 years ago

VitalyLub commented 8 years ago

scrapers.zip

akariv commented 8 years ago

Company scraper is integrated, NGO scraper needs some work: Parsing the HTML like it's done right now is too sensitive - any style or layout change will break the parsing, and we won't have any indication it happened... If possible, please replace the html text parsing with BeautifulSoup / PyQuery or similar. For some reasoning, please refer to this StackOverflow answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 @VitalyLub

VitalyLub commented 8 years ago

when it will be possible to make queries on the company's table? (maybe there is permission issue?)

about the NGO scraper - the HTML in guidestar really doesn't fit to BeatifulSoup. take a look about this for example: http://www.guidestar.org.il/he/organization/580624385 all the data tags called "field-content"!

@akariv

akariv commented 8 years ago

In which case the selector would be something like

div.views-field-field-gov-registration-number > .field-content

And you would do (pseudo-code):

gov-registration-number = 
    find("div.views-field-field-gov-registration-number > .field-content").text()

(which is also way more readable IMO)

VitalyLub commented 8 years ago

better?

VitalyLub commented 8 years ago

associations_scraping.zip