Closed jianructose closed 3 years ago
(Can be adjusted after our discussion)
[x] Mon 4pm: By the end of the lab, team contract, setup repo, task assignment
[x] Mon 10pm (after lab): finish item 3. start 4, 5, 6.
[x] Thu 4pm: meeting time. finish item 3, 4, 5, 6
[x] Sunday 2pm: All tasks reviewed and merged
[x] Sunday 5pm: Milestone submitted on Canvas and released on Github
lets leave it open till the submission. ✌
Milestone 1: Tackling big data on your laptop
Overall project goal and data
During this course, you will be working on a team project involving big data. The purpose is to get exposure to working with much larger datasets than you have previously in MDS. You have been assigned to teams of three or four. (See group assignment in Canvas.) Unlike previous project courses, in this course, all of you will be working on the same problem. In particular, you will be building and deploying ensemble machine learning models in the cloud to predict daily rainfall in Australia on a large dataset (~12 GB), where features are outputs of different climate models and the target is the actual rainfall observation.
You will be using this dataset on figshare. The dataset has been put together by Tom. See [this notebook](PUT THE NOTEBOOK LINK) if you're interested in understanding how the data was prepared for you.
At the end of the project, you should have your ML model deployed in cloud for others to use.
During this course, you will work towards this goal step by step in four milestones.
Milestone 1 checklist
Part of the purpose of this milestone is to annoy you by making you work with large data in
Pandas
and vanilla CSV files. Typically these are not the best for dealing with large data. Along the way, you will also explore some useful tools for working with big data.rubric={correctness:10}
[x] Similar to what you did in DSCI 522 and DSCI 524, create a team-work contract. The contract should outline how you are committed to work together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it for your new team.
[x] It is a fairly personal document and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it, or a copy with us in your submission to Canvas to prove you did this.
[x] https://docs.google.com/document/d/1u8rVjqlNMzkhuL58qpSObv5E20KwVM-lyCbID_8y1Eo/edit use this link to fill your expectations by 10PM on Mon
[ ] ASSIGNEE: @AishwaryaGopal12
[ ] 2. Creating repository and project structure
rubric={mechanics:10}
[x] 1. Similar to previous project courses, create a public repository under UBC-MDS org for your project.
[x] 2. Write brief introduction of the project in the
README
.[x] 3. Create a folder called
notebooks
in the repository and create a notebook for this milestone in that folder.[ ] ASSIGNEE: @jianructose
[x] 3. Downloading the data
rubric={correctness:10}
[x] 1. Download the data from figshare to your local computer using the figshare API (you can make use of
requests
library).[x] 2. Extract the zip file, again programmatically, similar to how we did it in class.
[x] ASSIGNEES: @shoebillm
rubric={correctness:10,reasoning:10}
[x] 1. Use one of the following options to combine data CSVs into a single CSV.
[x] 2. When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
[x] 3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.
rubric={correctness:10,reasoning:10}
[ ] 1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
dtype
of your data[ ] 2. Discuss your observations.
[ ] 6. Perform a simple EDA in R
[ ] ASSIGNEES: @adibns
rubric={correctness:15,reasoning:10}
[ ] 1. Pick an approach to transfer the dataframe from python to R (.ipynb).
[ ] 2. Discuss why you chose this approach over others.
Specific expectations for this milestone
[x] In this milestone, we are looking for well-documented and self-explanatory notebook exploring different options to tackle big data on your laptop.
[x] Discuss any challenges or difficulties you faced when dealing with this large data on your laptops. Briefly explain your approach to overcome the the challenges or reasons why you were not able to overcome them.
[x] ASSIGNEES: @jianructose
Submission instructions
[ ] Aish will submit this week rubric={mechanics:5}
In the textbox provided on Canvas for the Milestone 1 assignment include:
[ ] The URL of your public project's repository
[ ] The URL of your notebook for this milestone