Vlad777 / mit-stanford

MIT-Stanford MOOC mashup
2 stars 0 forks source link

mit-stanford

http://cs160.sjsu-cs.org/spring2013/sec2group3/

MIT-Stanford MOOC mashup

Extra feature request:

In order to support continued maintenance of the application, we must ensure that our web scraping code can be verified and fixed if broken when the source websites are updated. Web scraping code is by its nature difficult to make so it works indefinitely. Sooner or later something in the source web sites will change that will break the web-scraping code. The following features will support being to react to those changes.

MIT
professor name (Charles/Takahiro)
professor image - url to image (I dont think this is applicable for MIT? - charles)
title - (Charles) course title
short description - is there such a thing as a short vs long description? (Meena??)
long description (Meena??)
course link - (Charles)
video link - (Vlad - there are many variations of how the courses organize their video links... I could use help on this as well.) link to first video, since there can be multiple. I suppose we will need a separate table to keep track of all the videos and other features we'll need. We'll need all the videos to calculate course_length. We'll need to be able to scrape all the video links ultimately.
start date - not applicable in our case, so set to '2001-01-01 01:01:01'
course_length - I suspect this is different from our feature of total video length, but I think we can use it anyway. However, I suspect this cannot be scraped easily, but must be calculated by calling youtube API. Leave at 0 for now I guess, unless you want to start working on figuring out the API calls needed. Note that some of the "videos" are actually mp3 recordings, which we'll need to download and use some library to determine its length from the file.
course_image - (Charles) just link to the image
category - (Charles) this may need be normalized... for now just extract
site - 'MIT'(Diem)

Stanford
professor name(Chris)
professor image - url to image(Chris)
title - course title(Chris)
short description - is there such a thing as a short vs long description? (Alice)
long description (Alice)
course link - (Chris)
video link - (Alice) link to first video, since there can be multiple. I suppose we will need a separate table to keep track of all the videos and other features we'll need. We'll need all the videos to calculate course_length. We'll need to be able to scrape all the video links ultimately.
start date - not applicable in our case, so set to '2001-01-01 01:01:01'(Alice)
course_length - (Alice) I suspect this is different from our feature of total video length, but I think we can use it anyway. However, I suspect this cannot be scraped easily, but must be calculated by calling youtube API. Leave at 0 for now I guess, unless you want to start working on figuring out the API calls needed. Note that some of the "videos" are actually mp3 recordings, which we'll need to download and use some library to determine its length from the file.
course_image - just link to the image(Chris) - Perhaps use the video thumbnail if nothing else is available?
category - this may need be normalized... for now just extract(Chris)
site - 'Stanford'(Diem)