This repository contains code for the birdclef-2022 Kaggle competition for the Data Science at Georgia Tech team.
Development has primarily been done on Windows 10, but the code is generally platform agnostic and runs on the default Kaggle kernel.
Checkout the repository to your local machine, and download the data from the
competition website. Ensure the data is extracted to the
data/raw/birdclef-2022
directory.
git clone https://github.com/acmiyaguchi/birdclef-2022
cd birdclef-2022
# download the data to the data/raw directory and extract
mkdir -p data/raw
# ...
# ensure that you can run the following command from the project root
cat data/raw/birdclef-2022/scored_birds.json | wc -l
# 23
Install the Google Cloud SDK and ask for permission to the
birdclef-2022
bucket. Run the following command to ensure you have the correct
permissions.
gsutil cat gs://birdclef-2022/processed/model/2022-04-12-v4/metadata.json
{
"embedding_source": "data/intermediate/embedding/tile2vec-v2/version_2/checkpoints/epoch=2-step=10849.ckpt",
"embedding_dim": 64,
"created": "2022-04-12T23:09:51.920185",
"cens_sr": 10,
"mp_window": 20,
"use_ref_motif": false
}
Run the sync.py
script to pull data down from the remote bucket.
python scripts/sync.py down
In particular, this will synchronize shared files from the data/processed
directory.
Install Python 3.7 or above. Install pipx to manage a few utilities like pip-tools and pre-commit.
pip install pipx
pipx install pip-tools
pipx install pre-commit
Install the pre-commit hooks. This will ensure that all the code is formatted correctly.
pre-commit install
Create a new virtual environment and activate it.
# create a virtual environment in the venv/ directory
python -m venv venv
# activate on Windows
./venv/Scripts/Activate.ps1
# activate on Linux/MacOS
source venv/bin/activate
Then install all of the dependencies.
pip install -r requirements.txt
Unit-testing helps with debugging smaller modules in a larger project. For example, we use tests to assert that models accept data in one shape and output predictions in another shape. We use pytest in this project. Running the tests can help ensure that your environment is configured correctly.
pytest -vv tests/
You can select a subset of tests using the -k
flag.
pytest -vv tests/ -k embed_tilenet
You can also exit tests early using the -x
flag and enter a debugger on
failing tests using the --pdb
flag.
The repository is structured in the following way.
Directory | Description |
---|---|
birdclef | The primary Python module that encapsulates all the competition code. |
data | Associated data files, not checked into the source code. |
notebooks | Notebooks, often for exploration and analysis. The naming convention is to use YYYY-MM-DD-{initials}-{notebook name}.ipynb |
figures | Figures that are checked into the repository. |
notes | Notes about the project. Filenames should be prefixed by github handle. |
scripts | Scripts for maintaining the development environment and other miscellaneous tasks. |
tests | Unit tests written in pytest. |
terraform | Terraform configuration files, for associated cloud resources. |
label-studio | Label Studio configuration files (may be deprecated) |
The python module has a few notable submodules.
Directory | Description |
---|---|
datasets | This contains code related the the soundscape task. |
models | This contains code related to different models used throughout the project. |
workflows | This contains code related to the workflows, such as the command line interaface. |
The data directory has three notable subdirectories.
Directory | Description |
---|---|
data/raw | Raw data files, which are provided by the competition. |
data/intermediate | Intermediate data files, generated by tasks in the repository and generally not shared. |
data/processed | Processed data files, which are shared across the team and into the kaggle notebooks. |
The majority of development notes can be found under the notes directory.
This repository uses pip-compile
to maintain dependencies. Please add direct dependencies to requirements.in
, rather than modifying requirements.txt
. After adding a dependency, run pip-compile
to generate a new requirements.txt
file. The sequence looks something like:
pipx install pip-tools # if you haven't installed it already via the quickstart guide
# add any new direct dependencies to requirements.in
pip-compile
# observe that requirements.txt has changed locally
# commit the result
See the following two notebooks:
The first notebook downloads any models in the shared GCP bucket
(gs://birdclef-2022
). It also downloads the main package in this repository,
using a private github token.
The second notebook contains the actual code. It simply mounts the output of the
model-sync
notebook and calls the birdclef.workflows.classify
command.
The approach for this year's competition is focused on unsupervised methods. In particular, the fast similarity matrix profile and tile2vec papers provide the technical foundation for methods found in the repository.