ActoKids / AD440_W19_CloudPracticum

3 stars 1 forks source link

Website crawler: round one #41

Closed mrvirus9898 closed 5 years ago

mrvirus9898 commented 5 years ago

It is time to crawl through some real data. Three URLs will be provided, and at user request, more will be posted. Data from this crawler does not need to be mapped or formatted, only saved to DynamoDB via JSON object. Don't be afraid to pollute DynamoDB right now; quality assurance will come later.

Please indicate the time spent on this, any issues that you are having, any good references you found for this subject, and credit anyone helped you out.

TARGET URLs: http://www.shadowsealsswimming.org/Calendar.html https://outdoorsforall.org/events-news/calendar/ https://footloosedisabledsailing.org/events/

MikeJLeon commented 5 years ago

Estimated time : 20 hours

Outdoorsforall is already completed, starting work on the other two.

Working with Lyndon to figure out how to push to Dynamo.

MikeJLeon commented 5 years ago

I've managed to get S3, Lambda, and Dynamo working. The script now is capable of writing to a bucket which will then trigger the Lambda to write to Dynamo. Current issue is sending -all- data to Dynamo, Lambda only works for 3 seconds and I will need to research more or have someone help me with it. I've changed my logic with my non javascript crawler to try and make it more dynamic in the future for other websites that are similar in style. I decided to forego scraping shadowseals and footloose in favor of getting my data onto Dynamo. It would not take long to create logic for these two websites in comparison to making my code AWS compatiable..

At this point I just need to clean up my code before the pull request.

I've updated my wiki to reflect all current changes https://github.com/ActoKids/web-crawler/wiki/Browser-Crawler

Actual time - 18 hours

MikeJLeon commented 5 years ago

pull request - https://github.com/ActoKids/web-crawler/pull/9

toddysm commented 5 years ago

Good! Few more things for this sprint:

rberry206 commented 5 years ago

I tested this. The crawler did exactly what it was supposed to. It had a running timer that was a little bit redundant as it would not show up for anyone other than a developer when running this code. Ideally this is just running on a server in the background doing its thing, so it should be as efficient as possible. The crawler itself was fantastic and the data pulled was abundant and clean.