NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
20 stars 0 forks source link

proposal for Columbia data science capstone #387

Closed damonmcc closed 9 months ago

damonmcc commented 9 months ago

due 12/11 or 12/8?

Columbia Data Science M.S. capstone projects

some favorite previous projects


from submission form

PROJECT DESCRIPTION Tell us more about your project.

Project Title

MOTIVATION, BACKGROUND AND OVERVIEW: Please state briefly what is the problem that the project tackles. The projects need to be focused on a data science problem that is engaging, relevant, clearly defined and of the right scope for a semester. When assessing the proposals we will be looking for a diverse set of problems that address different topics and technical requirements that our students can address. The evaluation criteria will include: Is this a data-science project? Can our students learn about a data science application in the real world? Is the proposed research problem important and can potentially have a big impact? Will our students be excited about it? Please provide your project description having these criteria in mind.

DATASETS The dataset(s) can be public or private. Please keep in mind that the students will need to list the project on their CV and the report will be public.

Please provide a sample of your dataset with at least 10 rows of tabular data/images/ and/or meta data.

If you’re using private data, have you confirmed with relevant stakeholders (e.g. legal, compliance, communications) that the data can be used for this project? (If the answer is No, please contact Jessica Rodriguez at jr3056@columbia.edu when you submit this form).

DATASET: Please provide a detailed description of the type of data that is required to address the problem. For example, is this social media data, medical data, financial data, etc? What is the size of the data. Will the organization provide the majority of the data or is the data accessible via other avenues/ sources? How much of the data is available? Do the students need to gather data? In assessing the projects, the availability and type of data will play an important role. Please consider these evaluation criteria for data requirements when submitting the proposal: Is the data set clearly defined? Is the data set complex and big enough for creating learning opportunities? Is the data set ready? (availability, need for processing) Does the data require extensive computing resources (if yes, can the affiliates provide resource/funding?)

DATA TYPE: Public data is data made available by a third party and is available to the general public. Novel data is data that has been recently published by the proposer or will be made public as part of this project. Private data is data that cannot be made available after the project ended. Please check all that apply. Uses Public Data Uses Private Data Uses Novel Data

HOW WILL THE DATASET BE MADE AVAILABLE? For example: CSV/XLS file, remote database, raw images or documents, REST endpoint, etc.

Type of Data Graphs, Networks Text Data Audio/Image/Video Geospatial Time Series

Work Requirements (Check all that apply) R Scraping (including API) Database (e.g. SQL) Preprocessing Visualization App/tool building

GOALS, OUTCOME and SKILLS Research Goals

Project Topic Social Good Biomedical Physical Sciences (chemistry, climate, etc.) Consumer Social Media Finance and Economics

Data Science Areas in this project? Statistics Casual Inference Deep Learning Reinforcement Learning Algorithms

Expected Outcome? Model Report Paper Software Other

SKILLS: What skills should students expect to learn through their project? Check all that apply. Project planning and scoping Data acquisition and scraping Data versioning and management Data cleaning Combining data sources Exploratory data analysis and visualization Supervised modeling Unsupervised modeling Establishing evaluation metrics Working with text data Working with image data Working with time series data Working with tabular data Working with geospatial data

What is the goal of this project? What questions do you want answered? What has been done already to achieve this goal?

What are the ethical considerations?

Are there any ethical concerns about the proposed project such as privacy, transparency, and bias that we should pay special attention to?

What is the relevant background needed for the project? In order to make sure we build the right team of students for each project, please provide information on the relevant background information that someone working on the project should have. What technical skills they should have and/or relevant literature (please provide citations) or tools (please provide links) they will need to know or be able to learn.

What are the quantitative and/or qualitative metrics that can be used to judge the successful completion of the capstone project?

Are international students on a F1 or J1 student visa eligible to work on this project?

Are you willing and/or able to work with students who are currently physically in another country (if time zone is not an issue?)

Are you willing to work with two teams of students?


from initial email

The Capstone Project course provides a unique opportunity for students in the M.S. in Data Science program to apply their knowledge of the foundations, theory, and methods of data science to address data-driven problems in the industry, government, the nonprofit sector or academia. Course activities focus on semester-long projects sponsored by our Industry Affiliates, NYC or an academic research lab. Project synthesizes the statistical, computational, engineering and social challenges involved in solving complex real-world problems. Typically, four or five students work together as a team on each project. Each team is supervised by a faculty mentor and/or an industry mentor and projects typically progress through the following phases:

  1. Background and problem definition
  2. Data wrangling, munging and cleaning
  3. Exploratory Data Analysis
  4. Coding prototypes of algorithms and models
  5. Data Visualization
  6. Reporting, communicating and ethics discussion
  7. Productionizing any models or algorithms if applicable
sf-dcp commented 9 months ago

As a follow-up to the coding caucus meeting on 11/30, the GIS team has received a few internal and external questions about zoning district changes over time. It would be interesting to do a pilot project (perhaps for 2 points in time) to analyze zoning code and area changes. Damon gave an example how cells can be tracked in space in biology - can research if we can use a similar approach.

damonmcc commented 9 months ago

@mbh329 for any project ideas

damonmcc commented 9 months ago

team decided not to submit a proposal due to limited bandwidth and plans for internal data science work

mbh329 commented 9 months ago

I have to switch my notifications so that I actually get notified when you tag me!