Closed ArtPoon closed 2 years ago
Running retrieve-vsportal.py
right now - original gzipped tar file was about 0.9Gb. The purpose of this script is to unpack the tar (two files) and write the FASTA file to an xz-compressed file, which is far more compact but time-consuming to compress.
art@Wernstrom scripts % python3 virusseq.py
[2022-03-16T15:14:02.960000] Caching download in memory
[2022-03-16T15:17:34.298280] Extracting file names in tar
[2022-03-16T15:17:45.593180] Exporting compressed files
[2022-03-16T15:33:16.597490] Wrote outputs to ./virusseq.2022-03-16T15:17:45.fasta.xz and ./virusseq.2022-03-16T15:17:45.metadata.tsv.gz
Compression step took about 15 minutes on a conventional desktop computer (iMac), using about 8GB of RAM. xz-compressed FASTA is 14Mb.
My concern is about the alignement, we decided to go with the alignement provided by Pangolin wich is not perfect but when comparing on a january release with msa from GISAID the issues where minors
I'm not doing any alignment here
oh ok only metadata processing.
uh, no... I'm going to be running the unaligned sequences through pangolin
...
Successful run of xz-compressed FASTA on 8 cores:
(pangolin) art@orolo:~/Downloads$ time mpirun -np 8 python3 mangolin.py virusseq.2022-03-16T15\:17\:45.fasta.xz
(3/8) processing 1000 sequences
...
(0/8) processing 714 sequences
real 25m14.995s
user 205m1.133s
sys 71m41.696s
art@orolo:~/Downloads$ head -n1 mangolin.0.csv > combined.csv && tail -n+2 -q mangolin.*.csv >> combined.csv
art@orolo:~/Downloads$ wc -l combined.csv
253706 combined.csv
art@orolo:~/Downloads$ head -n3 combined.csv
taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,version,pangolin_version,pangoLEARN_version,pango_version,status,note
hCoV-19/Canada/AB-ABPHL-30710/2021,AY.93,,,Delta (B.1.617.2-like),1.000000,0.000000,PANGO-v1.2.127,3.1.20,2022-02-28,v1.2.127,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; Oth alleles 0
hCoV-19/Canada/AB-ABPHL-30536/2021,AY.27,,,Delta (B.1.617.2-like),1.000000,0.000000,PANGO-v1.2.127,3.1.20,2022-02-28,v1.2.127,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; Oth alleles 0
Pushed to branch iss11
So far we have the metadata TSV from VirusSeq:
art@Wernstrom data_needed % gunzip -c virusseq.2022-03-16T15:17:45.metadata.tsv.gz | head -n2
study_id specimen collector sample ID sample collected by sequence submitted by submission date sample collection date sample collection date null reason geo_loc_name (country) geo_loc_name (state/province/territory) organism isolatefasta header name purpose of sampling purpose of sampling details anatomical material anatomical part body product environmental material environmental site collection device collection method host (scientific name) host disease host age host age null reason host age unit host age bin host gender purpose of sequencing purpose of sequencing details sequencing instrument sequencing protocol raw sequence data processing method dehosting method consensus sequence software name consensus sequence software version breadth of coverage value depth of coverage value reference genome accession bioinformatics protocol gene name diagnostic pcr Ct value diagnostic pcr Ct value null reason GISAID accession
PHO-ON ON_63835 Public Health Ontario (PHO) Public Health Ontario (PHO) 2021-11-03 2020-04-22 Canada Ontario Severe acute respiratory syndrome coronavirus 2 hCoV-19/Canada/ON-PHL-20-00189/2020 hCoV-19/Canada/ON-PHL-20-00189/2020 Diagnostic testing Not Provided Not Provided Nasopharynx (NP) Not Provided Not Provided Not Provided Swab Missing Homo sapiens COVID-1Restricted Access year 80 - 89 Female Missing Missing Illumina MiSeq Viral sequencing was performed following a tiling amplicon strategy using the ARTIC v4 primer scheme. Sequencing was performed using an Illumina MiSeq sequencing instrument. https://github.com/jts/ncov2019-artic-nf v1.7 BWA v0.7.17 iVar 1.3 99.5% 1338x MN908947.3 https://github.com/jts/ncov2019-artic-nf v1.7 Not Provided Not Provided EPI_ISL_1230917
The combined outputs of mangolin.py
:
art@Wernstrom data_needed % gunzip -c virusseq.2022-03-16T15:17:45.metadata.tsv.gz | head -n2
study_id specimen collector sample ID sample collected by sequence submitted by submission date sample collection date sample collection date null reason geo_loc_name (country) geo_loc_name (state/province/territory) organism isolatefasta header name purpose of sampling purpose of sampling details anatomical material anatomical part body product environmental material environmental site collection device collection method host (scientific name) host disease host age host age null reason host age unit host age bin host gender purpose of sequencing purpose of sequencing details sequencing instrument sequencing protocol raw sequence data processing method dehosting method consensus sequence software name consensus sequence software version breadth of coverage value depth of coverage value reference genome accession bioinformatics protocol gene name diagnostic pcr Ct value diagnostic pcr Ct value null reason GISAID accession
PHO-ON ON_63835 Public Health Ontario (PHO) Public Health Ontario (PHO) 2021-11-03 2020-04-22 Canada Ontario Severe acute respiratory syndrome coronavirus 2 hCoV-19/Canada/ON-PHL-20-00189/2020 hCoV-19/Canada/ON-PHL-20-00189/2020 Diagnostic testing Not Provided Not Provided Nasopharynx (NP) Not Provided Not Provided Not Provided Swab Missing Homo sapiens COVID-1Restricted Access year 80 - 89 Female Missing Missing Illumina MiSeq Viral sequencing was performed following a tiling amplicon strategy using the ARTIC v4 primer scheme. Sequencing was performed using an Illumina MiSeq sequencing instrument. https://github.com/jts/ncov2019-artic-nf v1.7 BWA v0.7.17 iVar 1.3 99.5% 1338x MN908947.3 https://github.com/jts/ncov2019-artic-nf v1.7 Not Provided Not Provided EPI_ISL_1230917
art@Wernstrom data_needed % head -n2 combined.csv
taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,version,pangolin_version,pangoLEARN_version,pango_version,status,note
hCoV-19/Canada/AB-ABPHL-30710/2021,AY.93,,,Delta (B.1.617.2-like),1.000000,0.000000,PANGO-v1.2.127,3.1.20,2022-02-28,v1.2.127,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; Oth alleles 0
and we want to reproduce the zipped metadata file that originates from GISAID and is currently stored in data_needed
:
art@Wernstrom data_needed % unzip -c metadata_CANall_last.zip | head -n4
Archive: metadata_CANall_last.zip
inflating: metadata_CANall_2022_03_08.csv
Virus_name Type Accession_ID Collection_date Location Additional_location_information Sequence_length Host Patient_age Gender Clade Pango_lineage Pangolin_version Variant AA_Substitutions Submission_date Is_reference? Is_complete? Is_high_coverage? Is_low_coverage? N-Content GC-Content
hCoV-19/Canada/QC-L00254708/2020 betacoronavirus EPI_ISL_1001069 2020-04-27 North_America_/_Canada_/_Quebec NA 29782 Human 28 Male G B.1.1282022-02-28 NA (M_D3G,NSP14_I150T,NSP12_P323L,Spike_D614G) 2021-02-15 NA True True NA NA 0.380196091599
It was easier to just combine the VirusSeq metadata and Pangolin output, added a new script for this (pango2vseq.py
)
remaining tasks:
@RaphaelRaphael - would you be able to take over with the tree reconstruction? I can walk you through the steps you need to get up to making the alignment
Yes I will have 2 hours 9 to 11 east time tomorrow morning: note that seed of pseudo random generator have to change in order to have 3 different subsampling
Yeah I added an option to set the random seed with --seed
About metadata from GISAID that we want to reproduce Ithink there isless than 10 fields that are really used in the script
@RaphaelRaphael - If you could let me know which fields in #17 that would be a big help, thanks!
generated 3 replicate sample FASTAs
running IQ-TREE2 on workstation
Host: orolo (AVX2, FMA3, 31 GB RAM)
Command: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s temp.fa -m GTR -ninit 10 -n 4
Seed: 714927 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Thu Mar 31 12:34:22 2022
Kernel: AVX+FMA - 8 threads (32 CPU cores detected)
not sure why -n
flag is being used twice as per this protocol, wouldn't one just overwrite other?
Ok that took less than half an hour about an hour:
BEST SCORE FOUND : -317330.796
Total tree length: 0.929
Total number of iterations: 4
CPU time used for tree search: 9584.913 sec (2h:39m:44s)
Wall-clock time used for tree search: 1398.992 sec (0h:23m:18s)
Total CPU time used: 27739.470 sec (7h:42m:19s)
Total wall-clock time used: 3801.592 sec (1h:3m:21s)
Spun up the next two trees, looks like about 10% RAM each (32GB on my workstation), 50% CPU
Oof - RAM usage jumped to about 30% each.
Crap, I maxed the RAM and hitting the swap, can't log in.
Both IQ-TREE processes completed while workstation was frozen. Ran root2tip.R
script on first tree:
art@orolo:~/git/duotang$ Rscript scripts/root2tip.R working/sample1.iqtree.nwk "_" "-1" > working/sample1.rtt.nwk
Loading required package: ape
Loading required package: lubridate
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
art@orolo:~/git/duotang/working$ ls -lt sample1*
-rw-rw-r-- 1 art art 587656 Mar 31 18:52 sample1.rtt.nwk
-rw-r--r-- 1 art art 657696 Mar 31 18:21 sample1.iqtree.nwk
took about half an hour for one tree - see if tempest is faster
For sample1
- definitely some outliers
setwd("~/git/duotang/working")
require(ape)
rtt <- read.tree("sample1.rtt.nwk")
tip.dates <- as.Date(sapply(rtt$tip.label, function(x) {
tokens <- strsplit(x, "_")[[1]]
tokens[length(tokens)]
}))
div <- node.depth.edgelength(rtt)[1:Ntip(rtt)]
par(mar=c(5,5,1,1))
plot(tip.dates, div, pch=19, cex=0.5, col=rgb(0.5,0,0,0.2), bty='n')
fit <- lm(div~tip.dates)
abline(fit, lwd=2)
#plot(residuals(fit), cex=0.5)
out <- summary(fit)
stderr <- out$coefficients[1,2]
abline(a=fit$coef[1]+3*stderr, b=fit$coef[2], lty=2)
abline(a=fit$coef[1]-3*stderr, b=fit$coef[2], lty=2)
Excellent!
So, what's up with that set of outliers all at the same time around
Good question! Looks like those points are largely from Quebec:
> identify(tip.dates, residuals(fit))
[1] 4669 9001
> rtt$tip.label[4669]
[1] "hCoV-19/Canada/QC-L00413201001/2021_B.1.351_2021-11-24"
> rtt$tip.label[9001]
[1] "hCoV-19/Canada/QC-L00413123001/2021_A.1_2021-11-24"
> identify(tip.dates, residuals(fit))
[1] 4541 9000
> rtt$tip.label[9000]
[1] "hCoV-19/Canada/QC-L00413118001/2021_A.1_2021-11-24"
> rtt$tip.label[4541] # this one is higher
[1] "hCoV-19/Canada/ON-PHL-21-44225/2021_B.1_2021-12-01"
It happen 2 weeks ago to map a sample in november that should not have been in november and it appears thay had a database problem in november ... I just wrote them about that
For example:
Rscript scripts/filter-rtt.R data_needed/sample2.iqtree.nwk data_needed/sample2.filtered.nwk data_needed/sample2.dates.csv
treetime --tree data_needed/sample2.filtered.nwk --dates data_needed/sample2.dates.tsv --clock-filter 0 --sequence-length 29903
python3 scripts/nex2nwk.py data_needed/sample2.nexus data_needed/sample2.timetree.nwk
Closing this issue, will make new issues for remaining tasks
This is a suggestion I raised at the pillar 6 meeting. Now that the number of sequences deposited in the VirusSeq portal is close to the number in GISAID, it should be possible to base the notebook exclusively on the former. As far as I can tell, the rate limiting step is generating PANGO lineage assignments for the VirusSeq data. We have some scripts for doing that here: https://github.com/PoonLab/covizu/tree/vsportal and here https://gist.github.com/ArtPoon/50e26fa99a2e9390d2797dcef00e8b21