TWJolly / fundraising_data_pull

Pulls data from the JustGiving API for a defined set of charities and highlights new pages
1 stars 1 forks source link

Script hangs up at "fundraising_page_data <-" #6

Closed daaronr closed 6 years ago

daaronr commented 6 years ago

Script hangs up at (console output below; it freezes there for several minutes):

> fundraising_page_data <-
+   map(fundraiser_search_data$Id, get_fundraising_data) %>%
+   reduce(bind_rows) %>%
+   left_join(fundraiser_search_data .... [TRUNCATED] `

Ideas?

TWJolly commented 6 years ago

The api calls can take several minutes, takes me about 20-30 mins to download all the data. Or is it not completing at all? Either way it might be worth adding in some prints so we know it's working.

daaronr commented 6 years ago

You are right, it takes a long time. I'll keep it running.

TWJolly commented 6 years ago

Just pushed a commit that makes it print out what it's doing - re-pull if you think it might me getting stuck

daaronr commented 6 years ago

Thanks. Will give it another go. It seems to have gotten stuck here, maybe because of my internet connection flaking out:

> > donation_data <-
> +   map(fundraising_page_data$pageShortName, get_fundraiser_donations) %>%
> +   reduce(bind_rows) %>%
> +   mutate(date_downloaded = S .... [TRUNCATED] 
> [2018-01-09 23:17:15] [info] asio listen error: system:48 (Address already in use)
daaronr commented 6 years ago

Yes it works -- There were a few warnings we could look into:

5: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
8: In bind_rows_(x, .id) : Unequal factor levels: coercing to character

...but it seems to have worked brilliantly! Fantastic!

It took a long time, but I'm not sure how to check the timings (I'm new to R). When we share it publicly we should put some notes about how long it might take ... and perhaps some tips on how to run in quicker, for those who just want a (random?) selection of pages fairly quickly.

TWJolly commented 6 years ago

Ah great! And yes some timings would be good. It should be pretty easy to get it to print some estimates of how long it's going to take. Do you know roughly how long it took on your machine?

Regarding the warnings - they aren't an issue. The code produces a table with each api call, sometimes it (usually incorrectly) decides that one or more of the columns in these small tables is categorical. The bind_rows function stacks these tables together and complains if the categories in each table are different. To deal with having mismatched categories it (correctly) converts the columns to character columns. I'll look at stopping these warnings, or just documenting why they occur.

daaronr commented 6 years ago

If I am looking at the file modified times correctly, it may have taken 33 minutes from (home BT connection, run at midnight)