Wytamma / GISAIDR

Programmatically interact with the GISAID database.
69 stars 9 forks source link

Difference in metadata columns #27

Open Wytamma opened 2 years ago

Wytamma commented 2 years ago

When I download a metadata file there is no AA Substitutions column... I'm not sure why we have different columns. It seems maybe different users are getting different results from GISAID? #26

tomwenseleers commented 2 years ago

Yes I think that's because you use the official interface that GISAID provides; the AA Substitutions column only seems present when one selects records directly on the GISAID website & then presses Download at the bottom, where one can then choose between downloading the metadata or the FASTA sequences...

With manually selected & downloaded GISAID records I get a tsv back with these columns

 [1] "Virus name"                      "Accession ID"                   
 [3] "Collection date"                 "Location"                       
 [5] "Host"                            "Additional location information"
 [7] "Sampling strategy"               "Gender"                         
 [9] "Patient age"                     "Patient status"                 
[11] "Last vaccinated"                 "Passage"                        
[13] "Specimen"                        "Additional host information"    
[15] "Lineage"                         "Clade"                          
[17] "AA Substitutions"

It's only when I used the GISAIDR download function that I get the columns

[1] "id"                      "virus_name"             
 [3] "passage_details_history" "accession_id"           
 [5] "collection_date"         "submission_date"        
 [7] "information"             "length"                 
 [9] "host"                    "location"               
[11] "originating_lab"         "submitting_lab"

which misses the AA Substitutions field (confirmed by directly inspecting the gisaidr_data_tmp.tar file)...

I think getting a tsv back with all the columns included could be supported if the download would be driven via RSelenium, similar to how I download the GISAID batch download packages that are available, https://stackoverflow.com/questions/72632118/download-covid-patient-metadata-from-gisaid-website-in-r-using-rselenium.

This would involve: (1) enter username & password at https://www.epicov.org/epi3/frontend and press Login button (2) press Search tab (3) press Select tab at the bottom (4) paste GISAID access nrs (no more than 10 000 at a time) (or point to csv file with desired access nrs) (5) press OK button (6) press Download button at the bottom (7) Select Patient status metadata or Nucleotide sequences (FASTA) (8) press Download

Aside from downloading particular records in this way (which should also get the AA substitutions field), I think supporting the download of the batch download packages via RSelenium could be cool too, but you would probably just have to put it in a separate function, as one can then only download the whole database (download+reading it in in R then just takes 2 mins), and not a particular subset.

Wytamma commented 2 years ago

Hi @tomwenseleers, I’m not using the offical GISAID interface (none exists as far as I can tell). GISAIDR just sends the equivalent HTTP requests that you send when using the website. I think the problem here is that we have different versions of GISAID? This is what my download panel looks like. There is no Patient status metadata or Nucleotide sequences (FASTA) option only Augur or acknowledgements. When I press download I get a zip that combines metadata and the sequences. Can you please double check the URLs for the steps above? My url is https://www.epicov.org/epi3/frontend ie /frontend. If I use https://www.epicov.org/epi3 without /frontend I get a 404 error. AD7DDCD2-74C8-433E-AB8A-00CD7FD65DCE

tomwenseleers commented 2 years ago

Ha sorry. What a shame then - it seems GISAID somehow decided to give different users different tiers of access or what? How is one supposed to write reproducible code to drive this?

The URL I get to start with is https://www.epicov.org/epi3/start which then gets me to https://www.epicov.org/epi3/frontend#174123 but the #XXXXXX nr at the end is different each time I login.

If I use GISAIDR I also get back a .tar file with sequences & metadata combined, and with metadata lacking that AA substitutions field. It is this that confused me, because if I manually log in to the GISAID website and select some records and press Download at the bottom I get this GISAID record download and I can download the metadata & sequences separately.

Aside from that I also have batch package download options available when I press on the Downloads button at the top of the page which for me looks like GISAID package download1 GISAID package download2 I know that the Genomic epidemiology tab is missing for most, but I thought everyone at least would have access to the tab Download packages with metadata? Or is that not the case? And would the fields & columns you get back also differ per user?

tomwenseleers commented 2 years ago

For the record, with my login & credentials, this is how I managed to download a separate metadatafile with all the columns I was given access to & the code given can also still be modified a bit to allow download of the FASTA; this is using RSelenium (so a bit different than your httrapproach). It also shows how to get the most recently uploaded records that are absent in the download package: https://stackoverflow.com/questions/72632118/download-covid-patient-metadata-from-gisaid-website-in-r-using-rselenium