Tune HDBSCAN parameters

Tune HDBSCAN parameters by minimizing VI with training/test data from natural populations.

[x] Define SARS-CoV-2 data for training and final analysis
- [x] Subsample timestamped SARS-CoV-2 open data from an earlier time period (e.g., January 2020-June 2021)
- [x] Subsample timestamped SARS-CoV-2 open data from a later time period (e.g., June 2021-June 2023) to include representation of recombinant lineages (10-20% recombinant lineages based on Nextclade Pango annotations with and without X prefix)
[ ] Calculate VI per combination of relevant cluster parameters (distance threshold only; keep min samples and min cluster size fixed at biologically realistic values)
- [x] Use 2016-2018 H3N2 HA data for training, 2018-2020 for test and final analysis
- [x] Calculate VI per method for H3N2 HA/NA data using optimal parameters from 2016-2018 training data
- [x] Use smallest distance threshold for a VI tie
- [x] Use 2020-2021 SARS-CoV-2 data for training with Nextstrain clade
- [x] Use 2020-2021 SARS-CoV-2 data for training with Nextclade pango (use these optimal parameters for testing)
- [x] Determine how we want to "roll up" pango lineages (use Cornelius's tool)
- [x] Use 2021-2023 for test and final analysis with Nextclade pango (roll up the same way we do for training data)

Today I worked on the early and late SARS-CoV-2 datasets and came up with the following commands to define these datasets. We probably want to push timestamped versions of the original complete metadata and aligned sequences to a public S3 bucket, for posterity, or Zenodo, etc. Then we can recreate the analysis from scratch with the same inputs. Along those lines, these commands eventually need to be part of the workflow for the paper. We could probably define these commands in a data prep Snakefile and then define separate workflow files for early and late analyses like with flu.

The late dataset subsamples with a slight preference for recombinant lineages. I used awk to implement this logic below, but a Python implementation would be more robust and portable and readable, so that should come next.

# Download full open metadata.
curl -OL https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst

# Select only high-quality sequences based on Nextclade
# QC status and select columns of interest.
zstd -d -c metadata.tsv.zst \
    | tsv-filter -H --str-eq QC_overall_status:good \
    | tsv-select -H -f strain,genbank_accession_rev,date,region,country,originating_lab,submitting_lab,Nextstrain_clade,Nextclade_pango \
    | zstd -c > filtered_metadata.tsv.zst

# Remove full metadata.
rm -f metadata.tsv.zst

# Force include the reference strain for rooting.
echo "Wuhan-Hu-1/2019" > reference_strain.txt

# Subsample early samples.
augur filter --metadata filtered_metadata.tsv.zst --include reference_strain.txt --min-date 2020-01-01 --max-date 2022-01-01 --group-by region week --subsample-max-sequences 2000 --output-strains early_global_strains.txt

# Assign random priorities with slight preference for
# recombinant lineages.
zstd -c -d filtered_metadata.tsv.zst | tsv-select -H -f strain,Nextclade_pango | sed 1d | awk 'OFS="\t" { priority = rand(); if (substr($2, 1, 1) == "X") { print $1,sprintf("%.2f", priority + 0.25) } else { print $1,sprintf("%.2f", priority) }}' > priorities.tsv

# Subsample late samples, prioritizing recombinants.
augur filter --metadata filtered_metadata.tsv.zst --include reference_strain.txt --min-date 2022-01-01 --group-by region week --subsample-max-sequences 2000 --priority priorities.tsv --output-strains late_global_strains.txt

# Download aligned sequences.
curl -OL https://data.nextstrain.org/files/ncov/open/aligned.fasta.zst

# Extract metadata and sequences for early and late samples.
augur filter --metadata filtered_metadata.tsv.zst --sequences aligned.fasta.zst --exclude-all --include early_global_strains.txt --output-metadata early_global_metadata.tsv --output-sequences early_global_aligned.fasta

augur filter --metadata filtered_metadata.tsv.zst --sequences aligned.fasta.zst --exclude-all --include late_global_strains.txt --output-metadata late_global_metadata.tsv --output-sequences late_global_aligned.fasta

blab / cartography

Tune HDBSCAN parameters #36