Sayani07 / gsoc2019

Google Summer of Code 2019 application template
0 stars 0 forks source link

Proposal for gravitas: exploring probability distributions for bivariate temporal granularities

Student information

Name: Sayani Gupta

University: Monash University, Australia

Email: gupta.sayani@gmail.com

Student Email : Sayani.gupta@monash.edu

Github: https://github.com/Sayani07

Twitter: https://twitter.com/SayaniGupta07

Timezone: AEST (UTC + 11:00)

Abstract

Project gravitas aims to provide methods to operate on time in an automated way, to deconstruct it in many different ways. Deconstructions of time that respect the linear progression of time like days, weeks and months are defined as linear time granularities and those that accommodate for periodicities in time like hour of the day or day of the month are defined as circular granularities or calendar categorizations. Often visualizing data across these circular granularities are a way to go when we want to explore periodicities, pattern or anomalies in the data. Also, because of the large volume of data in recent days, using probability distributions for display is a potentially useful approach.The project will provide these techniques into the tidy workflow, so that probability distributions can be examined in the range of graphics available in the ggplot2 package.

Motivation

Considerable data is accumulated by sensors today. An example is data measuring energy usage on a fine scale using smart meters. Smart meters are installed on many households in many countries now. Providing tools to explore this type of data is an important activity. Probability distributions are induced by various aggregations of the data, by temporal components, by spatial region or by type of household. Visualizing the probability distribution of different households across different circular granularities will help energy retailers find similar households or understand consumer behavior better and consequently increase efficiency in planning ahead.

Project gravitas

Project gravitas will consist of four parts:

  1. R Package: Develop an R package consisting of modules BUILD and COMPATIBLE

  2. Shiny UI: Develop an user interface using RShiny to enable user to walk through different modules of the package

  3. Application: Provide examples of probability visualization of smart meter data collected on Australian households

  4. Vignette: Document the R package functionality in a vignette

Module BUILD

The module BUILD will provide the methods to exhaustively construct any granularities. Some of the functions that will be created as part of this module are as follows:

Function name Description Arg 1 Arg 2
ghour (.data, .grantype) define granularities in combination with hour, e.g. hour of the week a tsibble object week
gqhour (.data, .grantype) define granularities in combination with quarter hour, e.g. quarter hour of the day a tsibble object day
gweek (.data, .grantype) define granularities in combination with week, e.g. week of the month a tsibble object month

The data fed is a tsibble so that time indices are preserved as an essential data component. The function will create granularities in combination with any time index that are one or multiple levels higher in resolution than the time index in the given tsibble.

The idea in this module is to have exhaustive set of granularities at our disposal so that we can use them in later module to figure out periodicities, patterns or anomalies in the data.

Module COMPATIBLE

The module COMPATIBLE will provide checks on the feasibility of plotting or drawing inference from two granularities together. The function will categorize pairs of granularities as either a harmony or clash, where harmonies are pairs of circular granularities that aid exploratory data analysis. Clashes are pairs that are incompatible with each other for exploratory analysis. This module will provide appropriate data structures to visualize with the grammar of graphics for harmonies.

Function name: compatible

Description: provide checks on the feasibility of plotting or drawing inference from bivariate temporal granularities

Usage: compatible(.data, build1, build2)

Arguments:

.data - A tsibble

build1, build2 - granularities computed in module BUILD through functions like gqhour, ghhour, ghour, gweek, gmonth, gquarter, gsemester, gyear. gqhour and ghour imply granularities that are obtained in combination with quarter of an hour and hour respectively. Can be numeric vector or functions.

Value: a data structure specifying number of observations per combination of the two builds, variation in the number of observations across combinations and declare them as "harmony" or "clash". These suggestions will serve as a guide to users who are looking to explore distributions of variables across different harmonies.

Related work

Brief Timeline

Detailed Project Timeline

Phase Description Time frame Task Description
Phase 0 Pre-GSoC Period (11 Apr - 6 May) weeks 1-3 Literature reading, and outline broad range of functions to be developed for each modules. Compiling range of applications and data, additional to smart meters, and creation of small testing sets.
Phase 1 Community Bonding Period (7 May - 26 May) weeks 1-2 Lay out the road map for the project with the guidance of the mentors and establish modes of communication. Create and share github site for the project, and set up issues, code structure, invite mentors as collaborators. Add travis CI to site to continuously check code as it is developed. Brainstorm the ideas with mentors, range of applications and test data sets compiled, and code structure to be developed.
Phase 2 Coding Period 1 (27 May - 23 June) week 1 Code simple granularity functions (BUILD). Create unit tests.
week 2 Code remaining granularity functions (BUILD), and unit tests. Document with roxygen. Provide example code on test data sets.
week 3 Code advice functions (COMPATIBILTY), and unit tests. Document with roxygen. Provide example code on test data sets.
week 4 Documentation in the form of a vignette on broader range of data applications. CRAN tests completed.
Phase 3 Phase 1 Evaluations (24 June - 28 June) Announce package base to social media community, and request input, and code review. Report on current progress at project’s wiki page after feedback from mentors.

Deliverables:

1. R package with granularity and advice functions
2. documentation with roxygen
3. unit tests
4. example code on test data
5. CRAN tests
6. progess report
Phase 4 Coding Period 2 (29 June - 22 July) week 1 Address issues and improvements from community on functions (BUILD and COMPATIBILITY) and make the package ready for submission to CRAN.
weeks 2-3 Develop shiny UI to guide an user to navigate through the modules of the R package.
Phase 5 Phase 2 Evaluations (23 July - 26 July) Create user documentation for the shiny app. Announce shiny app to social media community, and request input, code and features review. Report on current progress at project’s wiki page after feedback from mentors.

Deliverables:

1. address issues and improvements on R package
2. Shiny UI
3. documentation for shiny app
4. progess report
Phase 6 Coding Period 3 (27 July - 18 August) week 1 Work on improving the shiny UI features based on reviews and issues.
week 2 Document the R package functionality in a vignette
week 3 Provide examples of probability visualization of smart meter data collected on Australian households
Phase 7 Final Week of submitting product (19 August - 26 August) Use as buffer to wrap up and address remaining issues. Document ideas that resulted from discussion with mentors and community but could not be implemented due to time limitations. These can then form the basis of future work.
Phase 8 Final evaluation (27 August - 2 September) Announce vignette to social media community. Write a blog on analysis of Australian smart meter data exemplifying the usefulness of functions developed in this project. Report on all work done at project’s wiki page after feedback from mentors.

Deliverables:

1. improve Shiny UI features
2. R package vignette
3. application on Australian smart meters
4. progess report

Additional information regarding timeline

Mentors

Brief bio

I'm a PhD student in the Department of Econometrics and Business Statistics at Monash University, Australia. My research interests are in data visualization, exploratory analysis, time series analysis and computational statistics. I had completed M.Stat (Masters in Statistics) from the Indian Statistical Institute with Quantitative Economics specialization and Bachelors (Honors) in Statistics from St.Xavier’s College, University of Calcutta, following which I had worked with KPMG and PAYBACK in the Analytics/ Modelling team. Although I have had a training in Pure Statistics, I have always been fascinated by the role of statistics in real world applications. I love working with data and create useful tools that can be deployed by users in different applications. One of the interesting projects that I had implemented using R can be found here. Thus, I have used R as a tool to implement statistical techniques earlier. However, looking at it from the perspective of open source technology has given it a totally different meaning. It made me realize that working on an open source project amounts to improving the quality of the final product through transparent collaboration with others. Moreover, I also learnt the importance of documented and easy to read codes. Some of my open source contributions can be found here.