Programmatically interact with the GISAID EpiCoV, EpiPox, and EpiRSV databases.
[!TIP] Please consider moving your research focus to an open pathogen on pathoplexus.org.
If you use GISAIDR in your research please cite as:
Wytamma Wirth, & Sebastian Duchene. (2022). GISAIDR: Programmatically interact with the GISAID databases. Zenodo. https://doi.org/10.5281/zenodo.6474693
Install from github using devtools
.
install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("Wytamma/GISAIDR")
Get username
and password
from GISAID.
username = Sys.getenv("GISAIDR_USERNAME")
password = Sys.getenv("GISAIDR_PASSWORD")
Login and save your credentials (they are used for all future database queries)
credentials <- login(username = username, password = password)
The EpiCoV database is selected by default, however, GISAIDR also works with the EpiRSV and EpiPox databases (limited testing).
credentials <- login(username = username, password = password, database="EpiRSV")
# or
credentials <- login(username = username, password = password, database="EpiPox")
Note: You need a GISAID account with access to EpiRSV and EpiPox.
Query the database with query()
using your credentials
df <- query(credentials = credentials)
head(df[0:6])
# | id | virus_name | passage_details_history | accession_id | collection_date | submission_date |
---|---|---|---|---|---|---|
1 | EPI_ISL_1789201 | hCoV-19/USA/IL-S21WGS954/2021 | Original | EPI_ISL_1789201 | 2021-04-16 | 2021-04-29 |
2 | EPI_ISL_1789200 | hCoV-19/USA/IL-S21WGS885/2021 | Original | EPI_ISL_1789200 | 2021-04-02 | 2021-04-29 |
3 | EPI_ISL_1789199 | hCoV-19/USA/IL-S21WGS884/2021 | Original | EPI_ISL_1789199 | 2021-04-12 | 2021-04-29 |
4 | EPI_ISL_1789198 | hCoV-19/USA/IL-S21WGS883/2021 | Original | EPI_ISL_1789198 | 2021-04-14 | 2021-04-29 |
5 | EPI_ISL_1789197 | hCoV-19/USA/IL-S21WGS882/2021 | Original | EPI_ISL_1789197 | 2021-04-15 | 2021-04-29 |
6 | EPI_ISL_1789196 | hCoV-19/USA/IL-S21WGS881/2021 | Original | EPI_ISL_1789196 | 2021-04-13 | 2021-04-29 |
Use nrows
and start_index
to page through results. GISAID limits the number of results returned with each request to 50. Internally GISAIDR runs a loop to batch queries with > 50 rows requested. See fast option below.
df <- query(credentials = credentials, nrows = 1000, start_index = 100)
nrow(df)
[1] 1000
Use fast
to load all of the accesion_ids that match the query. These accesion_ids can then be used in the download
function to download up to 5000 sequences at a time.
df <- query(
credentials = credentials,
location = "Oceania",
from_subm = "2022-07-26",
to_subm = "2022-07-28",
fast = TRUE
)
head(df$accession_id)
Selecting all 484 accession_ids.
Returning 0-484 of 484 accession_ids.
[1] "EPI_ISL_14061265" "EPI_ISL_14061266" "EPI_ISL_14061267" "EPI_ISL_14061268" "EPI_ISL_14061269" "EPI_ISL_14061270"
Use order_by
to order the results or query
by a column. Use order_asc
to change the direction of order_by
(defaults to TRUE).
df <- query(credentials = credentials, order_by = 'submission_date')
df$submission_date
[1] "2020-01-10" "2020-01-10" "2020-01-11" "2020-01-11" "2020-01-11" "2020-01-12" "2020-01-14"
[8] "2020-01-14" "2020-01-14" "2020-01-14" "2020-01-16" "2020-01-17" "2020-01-17" ...
Use text
for full text search.
accession_ids = c("EPI_ISL_17398411", "EPI_ISL_17199001", "EPI_ISL_17409201", "EPI_ISL_17243716")
df <- query(credentials = credentials, text = paste(accession_ids, collapse = "\n"))
> df$accession_id
[1] "EPI_ISL_17199001" "EPI_ISL_17243716" "EPI_ISL_17398411" "EPI_ISL_17409201"
Use location
to search for entries based on geographic location.
df <- query(credentials = credentials, location = 'Australia')
df$location
[1] "Oceania / Australia / Western Australia" "Oceania / Australia / Queensland"
[3] "Oceania / Australia / Queensland" "Oceania / Australia / Queensland"
[5] "Oceania / Australia / Western Australia" ...
A list of GISAID locations (not complete) can be found in GISAID_LOCATIONS.txt. The location search is hierarchical e.g. querying for 'Africa / ...' will return all the regions within Africa, while querying for 'Africa / Angola / ...' will only return the regions in Angola. Region can be further subdivided by specifying more levels e.g. 'North America / USA / Alabama / Butler County'. The search uses pattern matching and does not have to follow the hierarchical format above.
Use lineage
to search for entries based on pango lineage designations.
df <- query(credentials = credentials, lineage = 'B.1.1.7')
full_df <- download(credentials = credentials, list_of_accession_ids = df$accession_id) # see below for download() info.
full_df$pangolin_lineage
[1] "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7"
[11] "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7"
[21] ...
Variants can be queried by name e.g. 'omicron', 'gh/490r', 'delta', 'alpha', 'beta', 'gamma', 'lambda', or 'mu'. Unfortunately GISAID doesn't return the variant designation from the query or download so variants must be confirmed with pangolin_lineage or GISAID_clade.
# VOC Omicron GRA (B.1.1.529+BA.*) first detected in Botswana/Hong Kong/South Africa
omicron_df <- query(credentials = credentials, variant = 'omicron')
omicron_full_df <- download(credentials = credentials, list_of_accession_ids = omicron_df$accession_id)
omicron_full_df$pangolin_lineage
[1] "BA.2" "BA.2" "BA.2.10.1" "BA.2" "BA.2" "BA.5" "BA.2" "BA.2" "BA.2.3"
[10] "BA.4" "BA.2" "BA.2" "BA.2" "BA.2" "BA.2" "BA.2" "BA.2" "BA.1.17"
[19] ...
Use from
and to
to search for entries from specific dates.
df <- query(credentials = credentials, from = '2021-04-05', to = '2021-04-06')
df$collection_date
[1] "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05"
[8] "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05"
[15] ...
Use from_subm
and to_subm
to search for entries from specific dates.
df <- query(credentials = credentials, from_subm = '2021-04-05', to_subm = '2021-04-05')
df$submission_date
[1] "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05"
[8] "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05"
[15] ...
Use virus_name
to search for entries using the virus name.
df <- query(credentials = credentials, virus_name="hCoV-19/Ireland/D-BHTEST/2022")
df$virus_name
[1] "hCoV-19/Ireland/D-BHTEST/2022"
Search for multiple virus names using a list.
virus_names <- list("hCoV-19/Ireland/KY-Enfer-230922004_A4/2022", "hCoV-19/Ireland/CO-Enfer-240922010_E9/2022")
df <- query(credentials = credentials, virus_name=virus_names)
df$virus_name
[1] "hCoV-19/Ireland/CO-Enfer-240922010_E9/2022" "hCoV-19/Ireland/KY-Enfer-230922004_A4/2022"
You can also match parts of the virus name e.g.
df <- query(credentials = credentials, virus_name="hCoV-19/Ireland")
df$virus_name
[1] "hCoV-19/Ireland/KY-Enfer-260922007_C6/2022" "hCoV-19/Ireland/KY-Enfer-260922007_C4/2022"
[3] "hCoV-19/Ireland/KY-Enfer-260922007_C2/2022" "hCoV-19/Ireland/KY-Enfer-260922007_C10/2022"
[5] "hCoV-19/Ireland/KY-Enfer-260922007_C1/2022" "hCoV-19/Ireland/CO-Enfer-260922007_B7/2022"...
Use aa_substitution
and nucl_mutation
to search for entries using amino acid Substitutions and nucleotide mutations.
aa_substitution_df <- query(credentials = credentials, aa_substitution = 'Spike_E484Q, Spike_H69del, -N_P13L')
nucl_mutation_df <- query(credentials = credentials, nucl_mutation = '-T23599G, -C10029T')
Use low_coverage_excl
to exclude low coverage entries from the results.
df <- query(credentials = credentials, low_coverage_excl = TRUE)
grep("Long stretches of NNNs", df$information)
integer(0)
GISAID considers genomes >29,000 nt as complete. Use complete
to include only complete entries in the results.
df <- query(credentials = credentials, complete = TRUE)
all(df$length > 29000)
[1] TRUE
GISAID considers genomes with \<1% Ns and \<0.05% unique amino acid mutations as high coverage . Use high_coverage
to include only high coverage entries in the results.
df <- query(credentials = credentials, high_coverage = TRUE)
length(grep("warn_sign", df$information)) == 0
[1] TRUE
Use collection_date_complete
to include only entries with complete collection date.
df <- query(credentials = credentials, collection_date_complete = TRUE)
Use load_all
to get all the entries that match your query without having to specify nrows
.
df <- query(credentials = credentials, lineage = 'W.1', load_all = TRUE)
nrow(df)
[1] 100
Note: you may end up downloading the entire GISAID database if your query is too general.
Use total
to get the number of entries that match you query.
total <- query(credentials = credentials, total = TRUE)
total
[1] 10145747
To download the full data set you need a list of accession IDs (which can be obtained from query
results). This will also download the sequence data for each entry.
full_df_with_seq <- download(
credentials = credentials,
list_of_accession_ids = list_of_accession_ids,
)
full_df_with_seq$sequence
[1] "AGATCTGTTCTCTAAACGAACTTTAAAATCT...
[2] "AGATCTGTTCTCTAAACGAACTTTAAAATCT...
[3] "AGATCTGTTCTCTAAACGAACTTTAAAATCT...
...
You can stop GISAIDR from loading the sequence data into the memory by setting get_sequence=FALSE. Note: the sequence data will still be downloaded.
df <- query(credentials = credentials)
list_of_accession_ids <- df$accession_id
full_df <- download(credentials = credentials, list_of_accession_ids = list_of_accession_ids, get_sequence=FALSE)
colnames(full_df)
[1] "strain" "virus" "gisaid_epi_isl" "genbank_accession"
[5] "date" "region" "country" "division"
[9] "location" "region_exposure" "country_exposure" "division_exposure"
[13] "segment" "length" "host" "age"
[17] "sex" "Nextstrain_clade" "pangolin_lineage" "GISAID_clade"
[21] "originating_lab" "submitting_lab" "authors" "url"
[25] "title" "paper_url" "date_submitted" "purpose_of_sequencing"
Note: a maximum of 5000 results can be downloaded at a time.
Use the export_fasta function to write sequence data to a file in fasta format. The sequence names will be country\@pango_lineage\@accesion_id\@date, with the date in decimal format (requires the lubridate package). The default is to only export sequences for which a decimal date could be set. To prevent this, use the argument export_dated_only = F.
export_fasta(full_df_with_seq, out_file_name = 'GISAID_sequences.fasta')
Date format (default: decimal year) and delimiter (default: \@) can be set with the date_format and delimiter arguments respectively.
export_fasta(full_df_with_seq, out_file_name = 'GISAID_sequences.fasta', date_format='%Y-%m-%d', delimiter='|')
Use the columns
argument to choose which columns are included in the export.
export_fasta(full_df_with_seq, out_file_name = 'GISAID_sequences.fasta', columns = c("accession_id", "country", "pangolin_lineage", "date"))
GISAIDR relies on the custom selection interface of gisaid.org. If GISAIDR is giving you errors, first check that it is not gisaid.org producing these errors. We can't do anything to fix errors with gisaid.org.
If you have an epiflu account (i.e. you were using GISAID before COVID-19) you may have issues logging in as GISAID may default you to the epiflu database.
When updating GISAIDR run detach("package:GISAIDR", unload=TRUE)
first to ensure the update is applied.
Download all of the 2020 entries from Asia but outside China.
library(GISAIDR)
# country list from GISAID
Asia <- c('Asia / Afghanistan',
'Asia / Armenia',
'Asia / Bahrain',
'Asia / Bangladesh',
'Asia / Brunei',
'Asia / Cambodia',
# 'Asia / China', # remove China
'Asia / Georgia',
'Asia / Hong Kong',
'Asia / India',
'Asia / Indonesia / ...',
'Asia / Iran',
'Asia / Iraq',
'Asia / Israel',
'Asia / Japan',
'Asia / Jordan / ...',
'Asia / Kazakhstan / ...',
'Asia / Kuwait',
'Asia / Lebanon',
'Asia / Malaysia',
'Asia / Mongolia',
'Asia / Myanmar',
'Asia / Nepal',
'Asia / Oman / ...',
'Asia / Pakistan',
'Asia / Palestine / ...',
'Asia / Philippines',
'Asia / Qatar / ...',
'Asia / Saudi Arabia',
'Asia / Singapore',
'Asia / South Korea',
'Asia / Sri Lanka',
'Asia / Taiwan',
'Asia / Thailand',
'Asia / Timor-Leste',
'Asia / United Arab Emirates',
'Asia / Uzbekistan',
'Asia / Vietnam')
credentials <- login(username = username, password = password)
asia_not_china_df <- data.frame()
for (country in Asia) {
print(country)
df <- query(credentials = credentials, location = country, load_all = TRUE, from = "2020-01-01", to = "2020-12-31")
asia_not_china_df <- rbind(asia_not_china_df, df)
}
head(asia_not_china_df)
complete
checkbox) and inspect the source code.ceid
for this element (<div id="ce_qxos9a_bi">
) e.g. ce_qxos9a_bi
<input class="sys-event-hook" name="ce_qxos9a_bi_name" style="vertical-align: middle;" type="checkbox" value="complete">
) e.g. complete
.<script>
header tags you'll find a createFI() function e.g. this.getForm().createFI('ce_qxos9a_bi','CheckboxWidget','quality'
. The ceid
is dynamic so you need to use the widget name i.e. quality
to find the ceid
dynamically.login.R
file add code to extract the ceid
using the extract_search_ceid()
function and the widget e.g.# Complete
quality_ceid <- extract_search_ceid("quality'", customSearch_page_text)
ceid
to the list of credentials
e.g. complete_ceid = complete_ceid
query()
function in query.R
and internal_query.R
e.g. complete = FALSE
.(load_all && j$totalRecords > nrows)
load_all recursion loop so all paginations will continue using the argument.complete
argument is used. Create the command using the create_search_queue()
function. Use the complete_ceid
for the ceid
and the checkbox value (identified in step 4) for the cvalue
e.g.if (complete) {
queue <-
append(
queue,
create_search_queue(credentials,
ceid=credentials$complete_ceid,
cvalue=list('complete'),
cmd='FilterChange')
)
}
query()
call e.g. complete = complete
test-query.R
e.g.test_that("complete works", {
df <- query(credentials = credentials, complete = TRUE)
expect_true(all(df$length > 29000))
})
See this PR for the full details.