bnjy-opengov / chi-solicitations-feed

Wrapping the City of Chicago's publishing of solicitations in a machine-readable way.
1 stars 1 forks source link

Write Tests for Solicitations Search Page #4

Open nrrb opened 10 years ago

nrrb commented 10 years ago

Short of testing our app, because it's still amorphous and exploratory, we should have tests for the Solicitations Search Page itself to make sure the page is in the form we expect and that it behaves predictably. That way we'll have a methodical way to figure out what's wrong if/when our script stops work.

First idea for a test is to make sure that the URL being used in the script, https://webapps1.cityofchicago.org/VCSearchWeb/org/cityofchicago/vcsearch/controller/solicitations/begin.do?agencyId=city, actually results in this page: image

Some other things we might want to test for related to the search form: that the search button exists in the form, that the form method is post, that the form action is /VCSearchWeb/org/cityofchicago/vcsearch/controller/solicitations/search.do, use your imagination: image

We might also want to test for invalid values and make sure that we get error messages back from the interface, like giving a start date that occurs after the end date and make sure that it complains about this: image

We would also want to make sure that the table of search results is the right size and has the right column headings: image

pebreo commented 10 years ago

Ok so I've written tests for the Solicitations Page and it is agonizingly slow with Selenium. I'm going to re-write it using requests+lxml and use that approach for the other remaining tests. I think this is a better approach because I figure since we are just checking that the DOM elements are there, requests will do the job. For the actual scraping, we can use selenium.

I will write in the wiki how I wrote the Solications page tests in Selenium and how I did it using requests+lxml.

nrrb commented 10 years ago

That's a good idea, tests running faster is always a good thing.

The only point I would add, and this is just something to wonder about and not a question to answer immediately, is that the way lxml handles XPath may be different from the way selenium handles XPath. It should be the same, but there are some instances where this might not be the case.

For instance, the browser usually adds 'tbody' elements to the HTML defining a 'table' element even though this is rarely present in the page's original HTML.

image

On this page, the XPath to the Search button according to Firebug in Firefox is "/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr[14]/td/input". Trying to reach this via requests + lxml, I get nothing.

image

If I use that same xpath with selenium and Firefox, it finds the search button correctly.

image

pebreo commented 10 years ago

Hmm. Good observation. It seems to work fine - for both selenium requests - and when I search through all nodes like this:

#works - requests + lxml
root.xpath("//input[@name='search button']") # requests + lxml

#works - selenium
browser.find_element_by_xpath("//input[@name='search button']") 

For sure Selenium and lxml handle absolute XPATHs differently as you have proven. So that is something to look out for.

Also, check out the tests I have written so far:

https://github.com/pebreo/chi-solicitations-feed/blob/master/tests.py

Also, after attempting to write tests for form submissions I realized that it's very hard to check for proper responses for form submissions because the creators of the website have obfuscated form submissions. The reason I have come to this conclusion is that I tried submitting forms using requests and Scrapy and there is too much javascript interfering with form submissions. I suspect that the POST submission expect random numbers data from Javascript to be submitted along with the query data. So that means I will have to use a combination of Scrapy+Selenium to do the actual data scraping.