Open VitalyLub opened 8 years ago
Company scraper is integrated, NGO scraper needs some work: Parsing the HTML like it's done right now is too sensitive - any style or layout change will break the parsing, and we won't have any indication it happened... If possible, please replace the html text parsing with BeautifulSoup / PyQuery or similar. For some reasoning, please refer to this StackOverflow answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 @VitalyLub
when it will be possible to make queries on the company's table? (maybe there is permission issue?)
about the NGO scraper - the HTML in guidestar really doesn't fit to BeatifulSoup. take a look about this for example: http://www.guidestar.org.il/he/organization/580624385 all the data tags called "field-content"!
@akariv
In which case the selector would be something like
div.views-field-field-gov-registration-number > .field-content
And you would do (pseudo-code):
gov-registration-number =
find("div.views-field-field-gov-registration-number > .field-content").text()
(which is also way more readable IMO)
better?
scrapers.zip