databridgevt / covid19

Data analysis of the 2020 COVID-19 pandemic
MIT License
6 stars 2 forks source link

Introductions + kaggle data download Fall 2020 #68

Open chendaniely opened 4 years ago

chendaniely commented 4 years ago

Hi all:

As you are joining the team and getting acquainted with the project (and learning git).

Part 1: Introduce yourself to the repository

Please do the following to add your name to the "Teams" section of the README.md.

Follow the non-maintainer steps in: https://chendaniely.github.io/training_ds_r/help-faq.html#general-workflow

You should see your changes when you go to the repository page under the "Teams" section. This task also serves as your understanding of Git and makes sure the settings on this repository are correct. So, please let me know if you run into issues.

If you are having issues

Depending on when people clone the repository, when you try to push your changes you may be blocked for one of 2 reasons

  1. Permissions (403 error): let me know your GH username so I can add you to this repository as a maintainer
  2. Something about the remote having changes you don't have: if you keep reading the error message it's essentially telling you that you'd need to pull first before pushing again.
    • You may run into a merge conflict here depending on what lines were changed. Just let me know if you end up with problems here
    • The tl;dr is you need to open the README.md file and remove the >>>>>>>, =====, and <<<<< and clean up the entire file so you're happy with it. Then add, commit, and push again.

Part 2: Download the kaggle dataset

Tasks:

  1. Make sure you have Python installed (anaconda or miniconda is preferred, otherwise you'll have to manage your own virtual environment)
    • If you haen't already done so, read about python (conda) virtual environments here: https://daniel.rbind.io/2020/02/29/python-environments-with-conda/
    • Setup conda forge as the default repository:
      # run this in your terminal (anaconda command prompt for windows)
      conda config --add channels conda-forge
      conda config --set channel_priority strict 
  2. pull down the new updates from master. What you see on your computer should be what's displayed on GitHub
  3. Install/update the conda environment by going to this project in the terminal: conda env create -f environment.yml
    • The environment.yml has make in it so it should install make for you now
  4. Enable the environment with conda activate db_covid19
  5. Get your kaggle API information on your computer, directions here: https://github.com/Kaggle/kaggle-api#api-credentials
  6. make the data with make data_kaggle, it will install and unzip the kaggle dataset, know it's about 4.2GB after extraction

You should have all the kaggle files in the data/db/original/kaggle folder.

Optional, but will make your life easier later

If you get make installed outside the db_covid19 environment you can do all the setup steps listed above by running make setup_env followed by make data_kaggle. This will delete your db_covid19 environment and re-install all the packages in environment.yml from scratch.

  1. Go back to your base evironment: conda activate base
  2. Install make there conda install make

Copy of #1 #3