Course Evaluation
Final Project Reports (Due Dec 9)
Similar to progress reports with additional sections:
- Objective (research question)
- Data that was used: how obtained, how processed, integrated, and validated
- What models or algorithms were used
- Results: A description of the results
- Primary issues encountered during the project
- Future work: ideas generated, improvements that would make sense, etc
- Org chart: rough timeline and responsibilities for each member
Class on Nov Dec 2
- Final Project Presentations (MP3 Part D due)
- Comic Scraper
- DiscordTextAnalyzer Second Choice
- CinemaScores
- Generational Debt
- OSTI-Publication-Analysis
- StackBot
Class on Nov 25
- Final Project Presentations
- InterPlanetarySpider
- SteamSuggestion
- Library of Babel
- FantasyFootball Analysis
- TwitterGang
- MacroManage
Class on Nov 22
- Final Project Presentations
- ImageClustering
- CSM
Class on Nov 20
- Data Sharing (constructing shared data set produced in this class)
- Work on Final projects
- Help with MP3
Class on Nov 18
- Work on Final projects
- Help with MP3
Class on Nov 15
- Work on Final projects
- MP3 Part D discussion
Class on Nov 13
- Work on Final projects
- MP3 Part D discussion
Class on Nov 11
- Work on Final projects
- Instructions for Part D posted on MP3 page, video of class on
Nov 8
Class on Nov 8
- Introducing MP3 part 4 (final data analysis)
- Google's Cloud Hero is coming to UT Knoxville! During this 3-hour session, you’ll hear briefly from the Google Cloud team as to why cloud solutions are integral to your career, get access to further learning and career opportunities for a cloud-first world, and play a Infrastructure & Data game to be Cloud Hero UT Knoxville! There will be a competition with Google swag prizes and you will be given codes for free content on Coursera so you can continue your training for Google Cloud and make progress toward the highly marketable Google Cloud certification.
- When: Friday, November 8, 2019 from 1:00 p.m. - 4:00 p.m.
- Where: Haslam College of Business, Room 501
- Register: https://events.withgoogle.com/cloud-hero-utk/
Class on Nov 6
- Work on Final projects
- 21 still incomplete/missing for MP3 part C
Class on Nov 4
- Data Analysis Lecture
- Please finish MP3 Part C: only 14 are done as of this morning. Please note that connectB.sh had typos, so use the latest one
Class on Nov 1
- Work on Final projects
- Finish MP3 part C
Class on Oct 30
- Schedule presentation for the course project, please sign up by adding a comment on this issue (include team name and dates that would work for your team)
- Progress reports
- Script for MP3 Part C complete
Class on Oct 21, 23, 25
- Oct 23: Miniproject3 Part B due
- Will meet with teams to hear progress report: a report similar to the proposal, but indicating if anything changed
- Objective (research question)
- Data to be used: how obtained, how processed, integrated, and validated
- What models or algorithms will be used
- What will be done with rough timeline and responsibilities for each member
- A description of the partial results
- Problems encountered so far
Class on Oct 16
- Will help with Miniproject3 part B
Class on Oct 14
- Miniproject3 Part A due (this time really)
- Please note changes to MP3 Part B: the laptop should forward from port 3000!
- Chrome seems to work for Part B, but Safari and, possibly Firefox may not work (you have to be able to see the annotation once you save it)
Class on Oct 11
- Work on Final Project/Miniproject3 part B
Class on Oct 9
- Discuss Miniproject Part B
Class on Oct 7
- Cliff notes on text analysis
- Introduce MP3 part B
Class on Oct 2-4
- Work on final projects
- Ensure GCP works
Class on Sep 29
Class on Sep 27
Class on Sep 25 (complete project proposals)
Class on Sep 23 (complete Miniproject2)
- Miniproject2 is due at the end of the class
Class on Sep 20
Class on Sep 18
- The remaining teams are formed: Everyone has a final project at the end of the class
- Start brainstorming/writing final project proposal (see Sep 25)
Class on Sep 16
- Remaining final project pitches are due
- Most teams formed (create fdac19/ProjectName repo and a team of the same name; invite members of the team)
Class on Sep 13
- Present the remaining of the selected 10 miniproject1's to the class
- Pitches for the final project
- Introducing Data Discovery - Miniproject2
Class on Sep 11
- Present the selected 10 miniproject1's to the class
- Explain pitches for the final project
- How to resolve common problems
- Symptom: nothing appears in the browser for localhost:8888
- Solution: run /bin/notebook.sh in the docker container
Class on Sep 09
- Present your miniproject1 in small groups
Class on Sep 06
- Discuss ideas with your assigned peers, work on the miniproject1
Class on Sep 04
Class on Aug 30: Attend only if you need help with Practice0 face to face
- Attend ony if you need help with Practice0 task. It involves a number of steps, and if you get stuck on any of them please either
- Open an issue,
- Ask TAs to help before the class,
- Come to class and TA will be there to help, or participate virtually by
- Joining a zoom session (connection on the news page) I'll be there to answer your questions
Class on Aug 28
- Lecture explaining key technologies used in the class
Class on Aug 26
- Please submit the pull request (TAs will be in the class to help)
- TAs will help you set up ssh/putty so that you can access
jupyter notebooks
- Make sure ssh/putty setup works
- Full details
Class on Aug 23
- Make sure you accept your github invitations
- Follow through ssh/putty setup
Class on Aug 21
- Create your github account
- fork repo students
- create your utid.md file providing your name and interests:
see Audris.md for inspiration, and also provide your
utid.key with your public ssh key.
- submit a pull request to fdac19/students
- Make sure you do it during the class so we can start ready Aug 23
Information for remote participation via Zoom
Syllabus for "Fundamentals of Digital Archeology"
- Course: [COSCS-445/COSCS-545]
- MK-524 10:10-11:00 MWF
- Instructor: Audris Mockus, audris@utk.edu office hours MK613 - on request
- TA: Preston Provins pprovins@vols.utk.edu office hours available upon request
- TA: David Kennard dkennard@vols.utk.edu
- office hours MinKao 217, Wednesday: 2:30PM - 4:30PM, Thursday: 1:00PM - 3:00PM, Friday: 2:30PM - 4:30PM
- Syllabus
- Need help?
Simple rules:
- There are no stupid questions. However, it may be worth going over the following steps:
- Think of what the right answer may be.
- Search online: stack overflow, etc.
- Look through issues
- Post the question as an issue.
- Ask instructor: email for 1-on-1 help, or
to set up a time to meet
Objectives
The course will combine theoretical underpinning of big data with
intense practice. In particular, approaches to ethical concerns,
reproducibility of the results, absence of context, missing data,
and incorrect data will be both discussed and practiced by writing
programs to discover the data in the cloud, to retrieve it by
scraping the deep web, and by structuring, storing, and sampling it
in a way suitable for subsequent decision making. At the end of the
course students will be able to discover, collect, and
clean digital traces, to use such traces to construct meaningful
measures, and to create tools that help with decision making.
Expected Outcomes
Upon completion, students will be able to discover, gather, and analyze
digital traces, will learn how to avoid mistakes common in
the analysis of low-quality data, and will have produced a working
analytics application.
In particular, in addition to practicing critical thinking,
students will acquire the following skills:
-
Use Python and other tools to discover, retrieve, and process data.
-
Use data management techniques to store data locally and in the cloud.
-
Use data analysis methods to explore data and to make predictions.
Course Description
A great volume of complex data is generated as a result of human
activities, including both work and play. To exploit that data for
decision making it is necessary to create software that discovers,
collects, and integrates the data.
Digital archeology relies on traces that are left over in the course
of ordinary activities, for example the logs generated by sensors in
mobile phones, the commits in version control systems, or the email
sent and the documents edited by a knowledge worker. Understanding
such traces is complicated in contrast to data collected using
traditional measurement approaches.
Traditional approaches rely on a highly controlled and well-designed
measurement system. In meteorology, for example, the temperature is
taken in specially designed and carefully selected locations to
avoid direct sunlight and to be at a fixed distance from the ground.
Such measurement can then be trusted to represent these controlled
conditions and the analysis of such data is, consequently, fairly
straightforward.
The measurements from geolocation or other sensors in mobile phones
are affected by numerous (yet not recorded) factors: was the phone
kept in the pocket, was it indoors or outside? The devices are not
calibrated or may not work properly, so the corresponding
measurements would be inaccurate. Locations (without mobile phones)
may not have any measurement, yet may be of the greatest interest.
This lack of context and inaccurate or missing data necessitates
fundamentally new approaches that rely on patterns of behavior to
correct the data, to fill in missing observations, and to elucidate
unrecorded context factors. These steps are needed to obtain
meaningful results from a subsequent analysis.
The course will cover basic principles and effective practices to
increase the integrity of the results obtained from voluminous but
highly unreliable sources.
-
Ethics: legal aspects, privacy, confidentiality, governance
-
Reproducibility: version control, ipython notebook
-
Fundamentals of big data analysis:
extreme distributions, transformations, quantiles,
sampling strategies, and
logistic regression
-
The nature of digital traces:
lack of context,
missing values, and
incorrect data
Prerequisites
Students are expected to have basic programming skills, in
particular, be able to use regular expressions, programming concepts
such as variables, functions, loops, and data structures like lists
and dictionaries (for example, COSC 365)
Being familiar with version control systems (e.g., COSC 340), Python
(e.g., COSC 370), and introductory level probability (e.g., ECE 313)
and statistics, such as, random variables, distributions and
regression would be beneficial but is not expected. Everyone is
expected, however, to be willing and highly motivated to catch up in
the areas where they have gaps in the relevant skills.
All the assignments and projects for this class will use github and
Python. Knowledge of Python is not a prerequisite for this course,
provided you are comfortable learning on your own as needed. While
we have strived to make the programming component of this course
straightforward, we will not devote much time to teaching
programming, Python syntax, or any of the libraries and APIs. You
should feel comfortable with:
- How to look up Python syntax on Google and StackOverflow.
- Basic programming concepts like functions, loops, arrays, dictionaries, strings, and if statements.
- How to learn new libraries by reading documentation and reusing examples
- Asking questions on StackOverflow or as a GitHub issue.
Requirements
These apply to real life, as well.
- Must apply "good programming style" learned in class
- Bonus points for:
- Creativity (as long as requirements are fulfilled)
Teaming Tips
- Agree on an editor and environment that you're comfortable with
- The person who's less experienced/comfortable should have more keyboard time
- Switch who's "driving" regularly
- Make sure to save the code and send it to others on the team
Evaluation
-
Class Participation – 15%: students are expected to read all
material covered in a week and come to class prepared to take
part in the classroom discussions. Responding to other student
questions (issues) counts as classroom participation.
-
Assignments - 40%: Each assignment will involve writing (or modifying a template of)
a small Python program.
-
Project - 45%: one original project done alone or in a group of 2 or 3
students. The project will explore one or more of the themes covered
in the course that students find particularly compelling. The
group needs to submit a project proposal (2 pages IEEE format)
approximately 1.5 months before the end of term. The proposal
should provide a brief motivation of the project, detailed
discussion of the data that will be obtained or used in the project,
along with a time-line of milestones, and expected outcome.
Other considerations
As a programmer you will never write anything from scratch, but will
reuse code, frameworks, or ideas. You are encouraged to
learn from the work of your peers. However, if you don't try to do
it yourself, you will not learn. deliberate-practice
(activities designed for the sole purpose of effectively improving
specific aspects of an individual's performance) is the only way to
reach perfection.
Please respect the terms of use and/or license of any code you find,
and if you re-implement or duplicate an algorithm or code from
elsewhere, credit the original source with an inline comment.
Resources
Materials
This class assumes you are confident with this material, but in case you need a brush-up...
Other
Databases
- A MongoDB Schema Analyzer. One JavaScript file that you run with the mongo shell command on a database collection and it attempts to come up with a generalized schema of the datastore. It was also written about on the official MongoDB blog.
R and data analysis
- Modern Applied Statistics with S (4th Edition) by William
N. Venables, Brian D. Ripley. ISBN0387954570
- R
- Code School
- Quick-R
Tutorials written as ipython-notebooks
GitHub
- Git and GitHub
- GitHub Pages