Migrate from GISAID to VirusSeq

ArtPoon commented 2 years ago

This is a suggestion I raised at the pillar 6 meeting. Now that the number of sequences deposited in the VirusSeq portal is close to the number in GISAID, it should be possible to base the notebook exclusively on the former. As far as I can tell, the rate limiting step is generating PANGO lineage assignments for the VirusSeq data. We have some scripts for doing that here: https://github.com/PoonLab/covizu/tree/vsportal and here https://gist.github.com/ArtPoon/50e26fa99a2e9390d2797dcef00e8b21

ArtPoon commented 2 years ago

Running retrieve-vsportal.py right now - original gzipped tar file was about 0.9Gb. The purpose of this script is to unpack the tar (two files) and write the FASTA file to an xz-compressed file, which is far more compact but time-consuming to compress.

ArtPoon commented 2 years ago

art@Wernstrom scripts % python3 virusseq.py 
[2022-03-16T15:14:02.960000] Caching download in memory
[2022-03-16T15:17:34.298280] Extracting file names in tar
[2022-03-16T15:17:45.593180] Exporting compressed files
[2022-03-16T15:33:16.597490] Wrote outputs to ./virusseq.2022-03-16T15:17:45.fasta.xz and ./virusseq.2022-03-16T15:17:45.metadata.tsv.gz

Compression step took about 15 minutes on a conventional desktop computer (iMac), using about 8GB of RAM. xz-compressed FASTA is 14Mb.

RaphaelRaphael commented 2 years ago

My concern is about the alignement, we decided to go with the alignement provided by Pangolin wich is not perfect but when comparing on a january release with msa from GISAID the issues where minors

ArtPoon commented 2 years ago

I'm not doing any alignment here

RaphaelRaphael commented 2 years ago

oh ok only metadata processing.

ArtPoon commented 2 years ago

uh, no... I'm going to be running the unaligned sequences through pangolin...

ArtPoon commented 2 years ago

Successful run of xz-compressed FASTA on 8 cores:

(pangolin) art@orolo:~/Downloads$ time mpirun -np 8 python3 mangolin.py virusseq.2022-03-16T15\:17\:45.fasta.xz
(3/8) processing 1000 sequences
...
(0/8) processing 714 sequences

real    25m14.995s
user    205m1.133s
sys 71m41.696s

ArtPoon commented 2 years ago

art@orolo:~/Downloads$ head -n1 mangolin.0.csv > combined.csv && tail -n+2 -q mangolin.*.csv >> combined.csv
art@orolo:~/Downloads$ wc -l combined.csv 
253706 combined.csv
art@orolo:~/Downloads$ head -n3 combined.csv 
taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,version,pangolin_version,pangoLEARN_version,pango_version,status,note
hCoV-19/Canada/AB-ABPHL-30710/2021,AY.93,,,Delta (B.1.617.2-like),1.000000,0.000000,PANGO-v1.2.127,3.1.20,2022-02-28,v1.2.127,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; Oth alleles 0
hCoV-19/Canada/AB-ABPHL-30536/2021,AY.27,,,Delta (B.1.617.2-like),1.000000,0.000000,PANGO-v1.2.127,3.1.20,2022-02-28,v1.2.127,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; Oth alleles 0

ArtPoon commented 2 years ago

Pushed to branch iss11

ArtPoon commented 2 years ago

So far we have the metadata TSV from VirusSeq:

art@Wernstrom data_needed % gunzip -c virusseq.2022-03-16T15:17:45.metadata.tsv.gz | head -n2
study_id    specimen collector sample ID    sample collected by sequence submitted by   submission date sample collection date  sample collection date null reason  geo_loc_name (country)  geo_loc_name (state/province/territory) organism    isolatefasta header name    purpose of sampling purpose of sampling details anatomical material anatomical part body product    environmental material  environmental site  collection device   collection method   host (scientific name)  host disease    host age    host age null reason    host age unit   host age bin    host gender purpose of sequencing   purpose of sequencing details   sequencing instrument   sequencing protocol raw sequence data processing method dehosting method    consensus sequence software name    consensus sequence software version breadth of coverage value   depth of coverage value reference genome accession  bioinformatics protocol gene name   diagnostic pcr Ct value diagnostic pcr Ct value null reason GISAID accession
PHO-ON  ON_63835    Public Health Ontario (PHO) Public Health Ontario (PHO) 2021-11-03  2020-04-22      Canada  Ontario Severe acute respiratory syndrome coronavirus 2 hCoV-19/Canada/ON-PHL-20-00189/2020 hCoV-19/Canada/ON-PHL-20-00189/2020 Diagnostic testing  Not Provided    Not Provided    Nasopharynx (NP)    Not Provided    Not Provided    Not Provided    Swab    Missing Homo sapiens    COVID-1Restricted Access    year    80 - 89 Female  Missing Missing Illumina MiSeq  Viral sequencing was performed following a tiling amplicon strategy using the ARTIC v4 primer scheme. Sequencing was performed using an Illumina MiSeq sequencing instrument.   https://github.com/jts/ncov2019-artic-nf v1.7   BWA v0.7.17 iVar    1.3 99.5%   1338x   MN908947.3  https://github.com/jts/ncov2019-artic-nf v1.7   Not Provided        Not Provided    EPI_ISL_1230917

The combined outputs of mangolin.py:

art@Wernstrom data_needed % gunzip -c virusseq.2022-03-16T15:17:45.metadata.tsv.gz | head -n2
study_id    specimen collector sample ID    sample collected by sequence submitted by   submission date sample collection date  sample collection date null reason  geo_loc_name (country)  geo_loc_name (state/province/territory) organism    isolatefasta header name    purpose of sampling purpose of sampling details anatomical material anatomical part body product    environmental material  environmental site  collection device   collection method   host (scientific name)  host disease    host age    host age null reason    host age unit   host age bin    host gender purpose of sequencing   purpose of sequencing details   sequencing instrument   sequencing protocol raw sequence data processing method dehosting method    consensus sequence software name    consensus sequence software version breadth of coverage value   depth of coverage value reference genome accession  bioinformatics protocol gene name   diagnostic pcr Ct value diagnostic pcr Ct value null reason GISAID accession
PHO-ON  ON_63835    Public Health Ontario (PHO) Public Health Ontario (PHO) 2021-11-03  2020-04-22      Canada  Ontario Severe acute respiratory syndrome coronavirus 2 hCoV-19/Canada/ON-PHL-20-00189/2020 hCoV-19/Canada/ON-PHL-20-00189/2020 Diagnostic testing  Not Provided    Not Provided    Nasopharynx (NP)    Not Provided    Not Provided    Not Provided    Swab    Missing Homo sapiens    COVID-1Restricted Access    year    80 - 89 Female  Missing Missing Illumina MiSeq  Viral sequencing was performed following a tiling amplicon strategy using the ARTIC v4 primer scheme. Sequencing was performed using an Illumina MiSeq sequencing instrument.   https://github.com/jts/ncov2019-artic-nf v1.7   BWA v0.7.17 iVar    1.3 99.5%   1338x   MN908947.3  https://github.com/jts/ncov2019-artic-nf v1.7   Not Provided        Not Provided    EPI_ISL_1230917
art@Wernstrom data_needed % head -n2 combined.csv
taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,version,pangolin_version,pangoLEARN_version,pango_version,status,note
hCoV-19/Canada/AB-ABPHL-30710/2021,AY.93,,,Delta (B.1.617.2-like),1.000000,0.000000,PANGO-v1.2.127,3.1.20,2022-02-28,v1.2.127,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; Oth alleles 0

and we want to reproduce the zipped metadata file that originates from GISAID and is currently stored in data_needed:

art@Wernstrom data_needed % unzip -c metadata_CANall_last.zip | head -n4
Archive:  metadata_CANall_last.zip
  inflating: metadata_CANall_2022_03_08.csv  
Virus_name  Type    Accession_ID    Collection_date Location    Additional_location_information Sequence_length Host    Patient_age Gender  Clade   Pango_lineage   Pangolin_version    Variant AA_Substitutions    Submission_date Is_reference?   Is_complete?    Is_high_coverage?   Is_low_coverage?    N-Content   GC-Content
hCoV-19/Canada/QC-L00254708/2020    betacoronavirus EPI_ISL_1001069 2020-04-27  North_America_/_Canada_/_Quebec NA  29782   Human   28  Male    G   B.1.1282022-02-28   NA  (M_D3G,NSP14_I150T,NSP12_P323L,Spike_D614G) 2021-02-15  NA  True    True    NA  NA  0.380196091599

ArtPoon commented 2 years ago

It was easier to just combine the VirusSeq metadata and Pangolin output, added a new script for this (pango2vseq.py)

ArtPoon commented 2 years ago

remaining tasks:

[x] province-specific barplots
[x] downsampling of records based on some criteria @RaphaelRaphael ?
[x] extraction and pairwise alignment of genome sequences in sample
[x] relocate streamgraph code into another issue
[x] use new alignment to reconstruct phylogenies
[ ] mutational composition of Omicron
[ ] selection on Omicron sublineages
[x] root-to-tip regression
[ ] update acknowledgements

ArtPoon commented 2 years ago

@RaphaelRaphael - would you be able to take over with the tree reconstruction? I can walk you through the steps you need to get up to making the alignment

RaphaelRaphael commented 2 years ago

Yes I will have 2 hours 9 to 11 east time tomorrow morning: note that seed of pseudo random generator have to change in order to have 3 different subsampling

ArtPoon commented 2 years ago

Yeah I added an option to set the random seed with --seed

RaphaelRaphael commented 2 years ago

About metadata from GISAID that we want to reproduce Ithink there isless than 10 fields that are really used in the script

ArtPoon commented 2 years ago

@RaphaelRaphael - If you could let me know which fields in #17 that would be a big help, thanks!

ArtPoon commented 2 years ago

generated 3 replicate sample FASTAs

running IQ-TREE2 on workstation

Host:    orolo (AVX2, FMA3, 31 GB RAM)
Command: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s temp.fa -m GTR -ninit 10 -n 4
Seed:    714927 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Thu Mar 31 12:34:22 2022
Kernel:  AVX+FMA - 8 threads (32 CPU cores detected)

not sure why -n flag is being used twice as per this protocol, wouldn't one just overwrite other?

ArtPoon commented 2 years ago

Ok that took ~~less than half an hour~~ about an hour:

BEST SCORE FOUND : -317330.796
Total tree length: 0.929

Total number of iterations: 4
CPU time used for tree search: 9584.913 sec (2h:39m:44s)
Wall-clock time used for tree search: 1398.992 sec (0h:23m:18s)
Total CPU time used: 27739.470 sec (7h:42m:19s)
Total wall-clock time used: 3801.592 sec (1h:3m:21s)

ArtPoon commented 2 years ago

Spun up the next two trees, looks like about 10% RAM each (32GB on my workstation), 50% CPU

ArtPoon commented 2 years ago

Oof - RAM usage jumped to about 30% each.

ArtPoon commented 2 years ago

Crap, I maxed the RAM and hitting the swap, can't log in.

ArtPoon commented 2 years ago

Both IQ-TREE processes completed while workstation was frozen. Ran root2tip.R script on first tree:

art@orolo:~/git/duotang$ Rscript scripts/root2tip.R working/sample1.iqtree.nwk "_" "-1" > working/sample1.rtt.nwk
Loading required package: ape
Loading required package: lubridate

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date
art@orolo:~/git/duotang/working$ ls -lt sample1*
-rw-rw-r-- 1 art art 587656 Mar 31 18:52 sample1.rtt.nwk
-rw-r--r-- 1 art art 657696 Mar 31 18:21 sample1.iqtree.nwk

took about half an hour for one tree - see if tempest is faster

ArtPoon commented 2 years ago

For sample1 - definitely some outliers

ArtPoon commented 2 years ago

setwd("~/git/duotang/working")
require(ape)
rtt <- read.tree("sample1.rtt.nwk")
tip.dates <- as.Date(sapply(rtt$tip.label, function(x) {
  tokens <- strsplit(x, "_")[[1]]
  tokens[length(tokens)]
  }))
div <- node.depth.edgelength(rtt)[1:Ntip(rtt)]
par(mar=c(5,5,1,1))
plot(tip.dates, div, pch=19, cex=0.5, col=rgb(0.5,0,0,0.2), bty='n')
fit <- lm(div~tip.dates)
abline(fit, lwd=2)
#plot(residuals(fit), cex=0.5)
out <- summary(fit)
stderr <- out$coefficients[1,2]
abline(a=fit$coef[1]+3*stderr, b=fit$coef[2], lty=2)
abline(a=fit$coef[1]-3*stderr, b=fit$coef[2], lty=2)

fionabrinkman commented 2 years ago

Excellent! So, what's up with that set of outliers all at the same time around last November, if I'm reading this right? I guess they're associated with a big data dump?

ArtPoon commented 2 years ago

Good question! Looks like those points are largely from Quebec:

> identify(tip.dates, residuals(fit))
[1] 4669 9001
> rtt$tip.label[4669]
[1] "hCoV-19/Canada/QC-L00413201001/2021_B.1.351_2021-11-24"
> rtt$tip.label[9001]
[1] "hCoV-19/Canada/QC-L00413123001/2021_A.1_2021-11-24"
> identify(tip.dates, residuals(fit))
[1] 4541 9000
> rtt$tip.label[9000]
[1] "hCoV-19/Canada/QC-L00413118001/2021_A.1_2021-11-24"
> rtt$tip.label[4541]  # this one is higher
[1] "hCoV-19/Canada/ON-PHL-21-44225/2021_B.1_2021-12-01"

RaphaelRaphael commented 2 years ago

It happen 2 weeks ago to map a sample in november that should not have been in november and it appears thay had a database problem in november ... I just wrote them about that

ArtPoon commented 2 years ago

added an RTT filtering script to handle outliers, generates pruned ML tree and name/date TSV for treetime
ran all three samples through filter and treetime
added a script to convert treetime NEXUS format into Newick strings (takes a couple of minutes - FigTree is faster but this cmd-line script can be automated)

For example:

Rscript scripts/filter-rtt.R data_needed/sample2.iqtree.nwk data_needed/sample2.filtered.nwk data_needed/sample2.dates.csv
treetime --tree data_needed/sample2.filtered.nwk --dates data_needed/sample2.dates.tsv --clock-filter 0 --sequence-length 29903
python3 scripts/nex2nwk.py data_needed/sample2.nexus data_needed/sample2.timetree.nwk

ArtPoon commented 2 years ago

Closing this issue, will make new issues for remaining tasks

CoVaRR-NET / duotang

Migrate from GISAID to VirusSeq #11