apetkau / comp7944-project

Project on visualizing association rules extracted from covid-19 data.
Apache License 2.0
1 stars 1 forks source link

COMP 7944 Project

This project involves applying data mining techniques on COVID-19 data to extract association rules and visualize these rules as a network. This repository includes code and supplementary materials for the project and is available online at https://github.com/apetkau/comp7944-project.

Authors

Prepared for COMP 7944 at the University of Manitoba on April 23, 2020 by:

Interative visualizations

Below lists all the interactive versions of our visualization of association rules (as a network). For each dataset we produced two networks, one where nodes are colored by confidence and the other where nodes are colored by lift. Zooming or panning can be accomplished using the mouse, and network nodes can be dragged and dropped.

  1. Symptoms rules network
    1. Symptoms rules (confidence)
    2. Symptoms rules (lift)
  2. Geographic date network
    1. Geographic date rules (confidence)
    2. Geographic date rules (lift)
  3. Geographic age network
    1. Geographic age rules (confidence)
    2. Geographic age rules (lift)
  4. SNV/Genomics network
    1. Single Nucleotide Variant rules (confidence)
    2. Single Nucleotide Variant rules (lift)

Data sources

Copies of the two datasets we are using can be found at:

  1. Epidemiological dataset (data) - (alternative link)
  2. Genomics (SNV) dataset (data) - (alternative link)

Defining transaction datasets for mining

We processed the above datasets to define sets of items (transactions) for use with the Apriori algorithm for finding frequent itemsets and association rules. We constructed 4 separate transactional itemsets for further processing. Jupyter notebooks for processing this data is given below.

  1. Epidemiological dataset (code)
    1. Used to define the Symptoms, Geographic date, and Geographic age transactional itemsets for processing.
  2. Genomics/SNV dataset (code)
    1. Used to define the SNV/genomics dataset transactions.

Association rule mining and visualization

We next applied data mining techniques to find association rules in the above datasets and visualize the rules. Jupyter notebooks for this process are given below.

  1. Symptoms dataset (code)
  2. Geographic date dataset (code)
  3. Geographic age dataset (code)
  4. Genomics/SNV dataset (code)
    1. Phylogenetic tree construction (code)

Software

To reproduce this analysis you can use the following instructions to install dependencies using conda (though we note some additional R packages may need to be installed manually).

  1. Install Miniconda used for software dependency management.

  2. Install dependencies (from dependencies.conda file) using the command:

    conda create --name datamining --file dependencies.conda
  3. Activate the conda environment with installed software:

    conda activate datamining
  4. Run Jupyter lab.

    jupyter lab

You should now be able to load up the Juptyer notebooks and work with them.

License

The source data for this project (under the data/ directory) is redistributed under the respective licenses of the original providers. The code in this project is distributed under the Apache 2.0 license.