blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Tune HDBSCAN parameters #36

Closed huddlej closed 1 year ago

huddlej commented 1 year ago

Tune HDBSCAN parameters by minimizing VI with training/test data from natural populations.

huddlej commented 1 year ago

Today I worked on the early and late SARS-CoV-2 datasets and came up with the following commands to define these datasets. We probably want to push timestamped versions of the original complete metadata and aligned sequences to a public S3 bucket, for posterity, or Zenodo, etc. Then we can recreate the analysis from scratch with the same inputs. Along those lines, these commands eventually need to be part of the workflow for the paper. We could probably define these commands in a data prep Snakefile and then define separate workflow files for early and late analyses like with flu.

The late dataset subsamples with a slight preference for recombinant lineages. I used awk to implement this logic below, but a Python implementation would be more robust and portable and readable, so that should come next.

# Download full open metadata.
curl -OL https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst

# Select only high-quality sequences based on Nextclade
# QC status and select columns of interest.
zstd -d -c metadata.tsv.zst \
    | tsv-filter -H --str-eq QC_overall_status:good \
    | tsv-select -H -f strain,genbank_accession_rev,date,region,country,originating_lab,submitting_lab,Nextstrain_clade,Nextclade_pango \
    | zstd -c > filtered_metadata.tsv.zst

# Remove full metadata.
rm -f metadata.tsv.zst

# Force include the reference strain for rooting.
echo "Wuhan-Hu-1/2019" > reference_strain.txt

# Subsample early samples.
augur filter --metadata filtered_metadata.tsv.zst --include reference_strain.txt --min-date 2020-01-01 --max-date 2022-01-01 --group-by region week --subsample-max-sequences 2000 --output-strains early_global_strains.txt

# Assign random priorities with slight preference for
# recombinant lineages.
zstd -c -d filtered_metadata.tsv.zst | tsv-select -H -f strain,Nextclade_pango | sed 1d | awk 'OFS="\t" { priority = rand(); if (substr($2, 1, 1) == "X") { print $1,sprintf("%.2f", priority + 0.25) } else { print $1,sprintf("%.2f", priority) }}' > priorities.tsv

# Subsample late samples, prioritizing recombinants.
augur filter --metadata filtered_metadata.tsv.zst --include reference_strain.txt --min-date 2022-01-01 --group-by region week --subsample-max-sequences 2000 --priority priorities.tsv --output-strains late_global_strains.txt

# Download aligned sequences.
curl -OL https://data.nextstrain.org/files/ncov/open/aligned.fasta.zst

# Extract metadata and sequences for early and late samples.
augur filter --metadata filtered_metadata.tsv.zst --sequences aligned.fasta.zst --exclude-all --include early_global_strains.txt --output-metadata early_global_metadata.tsv --output-sequences early_global_aligned.fasta

augur filter --metadata filtered_metadata.tsv.zst --sequences aligned.fasta.zst --exclude-all --include late_global_strains.txt --output-metadata late_global_metadata.tsv --output-sequences late_global_aligned.fasta