Automate scraping - Githubissues

joerx commented 8 years ago

Every time the schedule is updated, we need to re-run the parser. Doing this manually is tedious and during the event there won't be any time for this.

The schedule is in this document: https://docs.google.com/spreadsheets/d/1QeAyxbEc1fP9h5_kGQh-EVrx5RaYgbKFJBE9wUjIdvc/edit#gid=1847379562

The worksheet "Versioning" contains the list of tracks and the column GID indicates which worksheet contains the session for that track. With every update a value in GID changes & we need to re-run the scraper. The changed files need to be pushed to Github.

Likely approach is using event triggers in AppScript (see https://developers.google.com/apps-script/guides/triggers/events). Scraper could be deployed to Heroku so we can invoke it via URL.

aviaryan commented 8 years ago

@joerx I will like to work to work on this issue. My plan of action on the heroku server is as follows -

For the first time, I will clone this repo on the server
Then with each spreadsheet update, a script will run which will -
Pull changes from this repo to server's copy.
Run the script scraper.py
Push the changed files back to repo

However I have a concern.

Free dyno has a 6 hour sleep time. If the spreadsheet is updated very regularly, like every 30 mins then the server will go offline for 6 hrs.

joerx commented 8 years ago

@aviaryan - Great, thanks! Few questions:

How are you planning to detect changes to the sheet?
How will the job be triggered?
Does your approach definitely require Heroku, or can we deploy it somewhere else if needed?

aviaryan commented 8 years ago

How are you planning to detect changes to the sheet? How will the job be triggered?

I haven't used Google apps script first hand but quick google searches show that setting an event for speadsheet change and calling an external url are both possible . If for some reason that doesn't work, I know for sure that IFTTT can help me set a trigger that will invoke on spreadsheet change and will call any url I want.

Does your approach definitely require Heroku, or can we deploy it somewhere else if needed?

Any server should work fine. The main requirement is that it should run python, bash and have git installed.

ashutoshsaboo commented 8 years ago

@joerx I tried to look into this. I had a suggestion, though not sure if it's the ideal one. So, How about this-:

We can execute the script, that does the scraping, from another script every 'n' number of hours (which we can decide) programmatically, on days, when some 'event' is not there, which we can first detect from the new script (by simply searching) , which will execute the scraping script.

Now, on the days of some 'event', we can reduce the number 'n' (which I mentioned above) to like, every hour, the script gets executed or so.

Because, we can't surely keep the script running 24x7, that keeps checking for changes in the Google Spreadsheet, because that will eventually put a lot of load on the server and maybe also crash it. I maybe wrong here, but that's what I thought eventually.

Or, in case if we want the changes to reflect in real-time, then we can first export the existing entries in the google apps script to a database, and then offer some 'Secure' web utility to the people for adding events, which directly saves the entries in the database, and then the changes can be reflected in real-time, by simply searching through the database.

I am not sure which is the ideal option. I would request you to give me views on the 2 alternatives and build on the same, so that we could work and fix this issue collaboratively. @joerx

Thanks! :smile:

aviaryan commented 8 years ago

Because, we can't surely keep the script running 24x7, that keeps checking for changes in the Google Spreadsheet, because that will eventually put a lot of load on the server and maybe also crash it. I maybe wrong here, but that's what I thought eventually.

Script is not checking for changes in Google spreadsheet. The latter calls it when it is changed. So there is no load.

@joerx Assuming that the spreadsheet is not changed in a span of 6 hours or so (in night time maybe) or not changed very regularly like every 30 mins, the heroku server should work just fine. I will test the server on a fork of this repo (my account) first and let you know how it goes.

joerx commented 8 years ago

Have you investigated AppScript? It seems to be similar to VBScript in MSOffice, except it's JS based. It should allow us to respond to changes as they happend w/o needing to schedule anything.

Having a cronjob is not the best solution since we'd need to host it somewhere (Heroku won't work with a free Dyno)

https://developers.google.com/apps-script/

aviaryan commented 8 years ago

Have you investigated AppScript? It seems to be similar to VBScript in MSOffice, except it's JS based. It should allow us to respond to changes as they happend w/o needing to schedule anything.

I have used AppsScript a bit. Yes it will allow us to set a trigger/do-something but the main thing is running the python script and then git. We will require a server for that. What do you suggest then, @joerx ? Can we get a paid heroku server ? or something else .. btw, I have got a cheeky solution to run a heroku server straight for 24 hours in free mode. I will try implementing that and testing it on my repo.

ashutoshsaboo commented 8 years ago

@aviaryan So, then how do we detect changes in the google spreadsheet? As, in then how will the latter call the scraping script when the Google Spreadsheet is changed?

Sorry. I didn't understand. Hence, thought I must ask about the same.

@joerx Yes, that seems to be a good idea. We can try that out.

aviaryan commented 8 years ago

@aviaryan So, then how do we detect changes in the google spreadsheet? As, in then how will the latter call the scraping script when the Google Spreadsheet is changed?

Google does it. When it sees that the spreadsheet has changed, it looks through all hooks (urls) attached to that event and calls them .

@joerx I am starting working on the apps-script and the server code now. It shouldn't take more than few hours.

joerx commented 8 years ago

Heroku. Free dynos need to sleep 6 hours in every 24hr period. That should do for just running the scraper. The instance can hibernate most of the time and will only need to wake up when we update the schedule.

aviaryan commented 8 years ago

@joerx Almost everything is done. Here is an automatically done commit. https://github.com/aviaryan/open-event-scraper/commit/27c7c405d309959ba112107fb5b2eb962a605cd3 I am sending a PR for it. Please check the output before merging.

Some finishing touches are left after which the bot will be completely ready. Then we can use it for this repo. :smile:

joerx commented 8 years ago

Good work! Let's aim at getting it set up tomorrow.

aviaryan commented 8 years ago

The work is complete. Unfortunately I am not able to test it because of issue #15.

aviaryan commented 8 years ago

Hi. Where are you on this issue ? The bot is working as expected. https://github.com/aviaryan/open-event-scraper/commits/master

motkeg commented 5 years ago

hi, is it still relevant issue? if it is - please comment the exact help needed.

fossasia / open-event-scraper

Automate scraping #13