ActoKids / AD440_W19_CloudPracticum

3 stars 1 forks source link

Finish Website crawler #72

Closed mrvirus9898 closed 5 years ago

mrvirus9898 commented 5 years ago

The website crawler has evolved in a lot of positive ways, including breaking up the scraper, and crawler, and taming outdoorsforall. However, there is one big issue, and that is the reliance on eventbrite. The events coming from eventbrite were good for debug purposes, but they are no geared towards children. Therefore, eventbrite needs to be removed, and a new website should replace it. Shadowseals is a good candidate, and there is already some code for it, but there is also https://footloosedisabledsailing.org/events/. Up to you which target to pick.

Please indicate the time spent on this, any issues that you are having, any good references you found for this subject, and credit anyone helped you out.

MikeJLeon commented 5 years ago

OFAScraper due to the nature of Selenium seems to be incompatible with Lambda as it requires an environment designed for it which Lambda does not provide. It is unknown to me how to make this work at this moment as it seems impossible. I've spent roughly 8-12 hours on this issue alone and before I had expected this to take no more then 6 hours.

MikeJLeon commented 5 years ago

I've at this moment spent about 20 hours. I've decided that we should instead consider using EC2 to run OFAscraper. I've not yet worked on the second crawler because this is a fatal issue of this crawler.

MikeJLeon commented 5 years ago

I had, on toddy's suggestion, decided to try layers in Lambda. It produced the same errors I was getting previously. I have since decided to use EC2 and have lambda trigger the instance. This has proven to work and I will say now that I am close to being done. Lambda is triggering the EC2 instance, the EC2 instance runs and writes to Dynamo for both OFA and Shadow Seals crawlers. Then it proceeds to shutdown two minutes after the script finishes.

Time Spent so far 25 hours.

daonguyen81 commented 5 years ago

I have spent 20 hours trying on this task and I couldn't get the chromedriver to run due to its permission on Lambda. I have gone though many sources and tutorials but getting the same error when executed the function. I also tried to setup the layer from the following source but I still couldn't get it to run. []https://medium.com/the-apps-team/running-selenium-and-headless-chrome-on-aws-lambda-layers-python-3-6-bd810503c6c3

mrvirus9898 commented 5 years ago

It seems that layers may not solve our chromedriver issue, which is a shame. However, an EC2 instance will resolve our issue, so we should move in that direction. Perhaps it would be a wise plan to migrate all the scripts to an EC2, and then have lambdas and triggers that support the EC2?

MikeJLeon commented 5 years ago

I have successfully gotten both SSScraper and OFAScraper owkring on EC2. We originally wanted these scrapers on Lambda but Lambda proved to be a pain when it came to running Selenium on the Lambda. I've wasted hours and hours trying to deal with the permissions errors (not caused by DEVOPS) related to chromedriver. In the end I started leaning towards just doing all of this on a EC2 instance. After talking to Anar he had set me up with a micro instance to put my code on.

I started playing with it and toddy suggested to go back to Lambda and try layers. I've ended up trying out Layers and it was proving to have the same issues as before. So I solidfied my want to just use EC2 for these crawlers as it has started to prove to be workiing.

I ended up creating a EC2 instance which is triggered by the lambda function -

ad440-w19-lambda-crawler-launchec2

this will turn on the instance. Then the instance has a start batch file that will turn on both scrapers. As the scrapers work they write to log files, ofalog.log and sslog.log. And they also write to the DYNAMODB events table.

After they finish running the instance waits 2 minutes before shutting down.

Relevant Issues - Issue @mrvirus9898 found with me on the Chromedriver -https://github.com/ActoKids/AD440_W19_CloudPracticum/issues/75 Issue @mrvirus9898 found with me on Chromedriver Permissions with some help from @T-travis https://github.com/ActoKids/AD440_W19_CloudPracticum/issues/76 I asked Anar for full request on Layers based on Toddy's request, Anar unfortunately could not figure out how to grant me the right permissions to apply my layers, even then I tried using layers on my personal account to figure out the chromedriver issues, None worked. https://github.com/ActoKids/AD440_W19_CloudPracticum/issues/79 I asked Anar for access to an EC2 server to implement the scrapers which he did. I also asked him for access keys for the EC2 to use for Dynamo. https://github.com/ActoKids/AD440_W19_CloudPracticum/issues/81 I asked devop's for allowing Lambda to access EC2, Got shota to help me on this one. https://github.com/ActoKids/AD440_W19_CloudPracticum/issues/84 Dao found an error after OFA had updated their website, I fixed it. https://github.com/ActoKids/AD440_W19_CloudPracticum/issues/89

Time estimated: 6 Hours Time spent: 35 Hours Pull Request link -

  1. https://github.com/ActoKids/web-crawler/pull/17
  2. https://github.com/ActoKids/web-crawler/pull/19

Wiki link: https://github.com/ActoKids/web-crawler/wiki/AWS---EC2 Deployed Code link's: - Update 03/13 - LAUNCHEC2 is not in use anymore. To test please go straight to the ec2 and launch manually. Dynamo may also fail soon as I updated the table names.

  1. Lambda - https://console.aws.amazon.com/lambda/home?region=us-east-1#/functions/ad440-w19-lambda-crawler-launchec2?tab=graph
  2. EC2 - https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:sort=instanceId
  3. Dynamo - https://console.aws.amazon.com/dynamodb/home?region=us-east-1#tables:selected=events;tab=overview
  4. Cloudwatch - https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=crawlerlogs;streamFilter=typeLogStreamPrefix

Testers: @daonguyen81 , @TyReed12

Tyler -

Everything worked as expected. Great work Mike!

Dao -

It worked great Mike, the Lambda trigger the EC2 to run both the crawler scripts, the log output worked as expected. The EC2 stopped after running the scripts as expected. I approve this PR.

However Dao tested it again this morning and created 89

What I tested: @rberry206 https://github.com/ActoKids/web-crawler/pull/18 I ran his google crawler lambda and it ran successfully, in the pull request I mention some errors that appeared on his code but despite the errors it still was successful. It posted to Dynamo and everything worked smoothly.