cstubben / ENAbrowseR

Search the ENA Browser REST URL
3 stars 1 forks source link

Suggestion: two stage data retrieval #2

Open mdehollander opened 7 years ago

mdehollander commented 7 years ago

I have been trying retrieve metadata (like location) for a complex query, but even on the ENA website that does not work in 1 step.

Goal is to retrieve all pure metagenomic dataset, so exclude amplicon based methods. That's possible with this query: m1 <- ena_search("library_source=METAGENOMIC AND library_selection!=PCR AND library_strategy=WGS AND tax_tree(408169) AND first_public>2016-01-01", showURL=TRUE, result="read_run")

This only works with read_run as result, for sample this information does not seem to be available. I think that using the sample_accession of the query m1 object, it should be possible to retrieve the metadata via this url: https://www.ebi.ac.uk/ena/data/view/<SAMPLE1>,<SAMPLE2>&display=xml From there you can get the latitude and longitude.

Would it be possible to integrate such 2-stage data retrieval step in the Shiny application? Then it would be possible to have stronger interactive filtering and include the results on the nice map.

cstubben commented 7 years ago

I think there are a few options to get sample metadata for all sample_accessions returned by a read_run query. At least in the example R shiny app, there may be >10K runs each week, so I'd still run a query in result=sample using the same first_public date and join that to the read_run table - that usually matches about ~90% of sample accs in runs. For the remaining runs that used old samples, I will check on formatting an OR separated list of IDs for the REST service to fill in the missing sample metadata.

Another option is to download all sample metadata and join your read_run results to that table. I often use ena_download to get samples by year and will update my scripts to use the new readr package for faster loading (and tibble formats) in the new release soon. Then you just need to select from that table to find old samples.

mdehollander commented 7 years ago

Thanks! I will have a look at that. What do you mean with 'old samples'?

I just came across another REST API at EBI: the EB-eye or ebisearch: http://www.ebi.ac.uk/Tools/webservices/services/eb-eye_rest With that I seem to be able to get only metagenomic samples with one: query Have you looked at this service as well?

cstubben commented 7 years ago

No, I have not used EB-eye, but will check. For old samples, I just mean that when people submit a Fastq file to run, they usually submit the sample metadata at the same time (and you can find both using a first public date query). In some cases, the run may use a sample that was already submitted, so they just link to the "old" sample metadata.