ResidentMario / urban-physiology-old

Urban Physiology project metarepository (old; see urban-physiology-toolkit for more up-to-date code).
0 stars 0 forks source link

Socrata portal pages do not load quickly enough under load to for the table glossarizer to be effective #7

Open ResidentMario opened 7 years ago

ResidentMario commented 7 years ago

To get information on the number of rows and columns in a Socrata portal table more easily, early on I turned to scraping that information off of the portal UI.

The problem: after 20 or 50 or 100 or so table pages scraped this way, the scrape process starts to break down.

The reasons are twofold. First of all, I've not succeeded in implementing a wait condition in selenium that knows how to wait exactly long enough for the page row and column count elements to be loaded. The tricky bit is that it is in a metadata-pair class, but there are many such classes on the page (always two, but possibly more). I don't know how long to wait during page loadtime (necessary in the first place because this is all in AJAX) because I don't know which of these individual elements thus classed to wait for, or how many of them are even on the page.

So I wrote a little time.sleep poller to do the wait...but that doesn't seem to work that great either. I think the second problem is that Socrata is rate-limiting page requests. I even went as far as implementing a 10-second-max wait time, but it wasn't enough...

I will need to look again whether or not I can implement a better wait condition, and if not, perhaps look into some other approach. As it stands, getting the entire catalog loaded requires running this multiple times, not ideal.

ResidentMario commented 7 years ago

After some more experimentation rate-limiting might not be the problem after all. It's really just all transient network issues; CKAN seems to have the same thing going on.

The solution, insofar as there is one, is to run the script several times. Since the results are cached, subsequent runs will only touch jobs that were not not previously processed, and hopefully get more of them done.