Use cases feedback - retrieving data from multiple sources

mcourtot commented 7 years ago

Not sure if this is meant to be already included under federated queries, but the specific use case I am thinking of is retrieving raw genomic data from EGA based on some search on BioSD, such as "retrieve genomic data for samples of male patients under 2y old" (or any other metadata query)

kozbo commented 7 years ago

I am not sure if this fits in the API. Are you talking about retrieving the read data?

mcourtot commented 7 years ago

Yes - specifically the user requests "read data for samples of male patients under 2y old"; we should do a first query to retrieve IDs of the samples, then pass this to the Streaming API to retrieve the raw data. It may that this is specific to the way Biosamples at EBI has the samples/metadata information and needs to pass results to the Streaming API which then connects to EGA, but I was considering than given the Streaming API implementations existing/planned (e.g. Google Nexus) there would be a need to leverage metadata somewhere to query the right raw data. Hope that clarifies, happy to hear other similar cases.

kozbo commented 7 years ago

Ah, yes, thanks for explaining. I like this use case. +1

lairdm commented 7 years ago

This is open for interpretation, but when I see federated queries I'm envisioning going out to other servers to find other data sets stored by partner organizations. For what you're describing, where it is for example all coming from EBI it's just different divisions with different pieces of the requested data, I'd see that as something that should be handled by the provider and hidden from the user. eg. the discussed central EBI portal, where the user makes a query and we poke all our needed internal services to find all the pieces needed to fulfil the request. But that's not federated, not across providers.

Regardless, I've made a note of your use case in the document.

mcourtot commented 7 years ago

As there is no central EBI portal it seemed analogous to me to having multiple providers (because let's be honest, each internal service has its own quirks ;)), but as mentioned I wasn't sure it that was what was intended already. In any case, I'm happy with whichever vocabulary choice, thanks for clarifying Matt.

kozbo commented 7 years ago

I interpreted this use case as one satisfied by a single server. But Matt raises a point that has been in some discussion here at UCSC. We have been thinking about how we could allow data to be local or on a remote server. This is especially useful for data best managed by the authority on the subject. References are the best example but ontologies and sequence ontologies also fit this model.

We haven't gotten far enough into thinking about this to suggest changes to the schemas. The one thought we did have is that it would make sense to split each service out from the others more cleanly.

Perhaps this is a use case we should capture in this document.

lairdm commented 7 years ago

There's definitely a lot more discussion needed, but take the example of a dataset where there might be metadata about the samples/individuals in it plus read data and a user asked for a particular slice from that. On the EBI side I would imagine that will be stored on different technologies for different pieces of the returned set (sample data in a database, sequence/reads from ENA or similar). Hiding fetching and assembling those pieces from the user is important, hiding that the collection of 20, 30, 50 endpoints that might eventually be available being across multiple servers behind the scenes is important.

I don't consider that federated, a ga4gh server that offers the set (or subset) of endpoints that we'll eventually define, how the maintainers of that service decide to implement responding to that collection of endpoints is internal to that provider. A monolithic server or a frontend server that sends different endpoints to different internal services, that's up to the individual implementer. When I said federated, I was referring to queries that crossed from one server that offered a particular endpoint to another server(s) that offer that same endpoint and the collating of those results.

Now, this could all just be semantics and we could all be envisioning the same sets of operations just using different terms. And sorry as the new kid if I'm coming off too strongly after all the work that's been done over the years up to now :)

kozbo commented 7 years ago

Ah, thanks for the clarification @lairdm. I was unaware of the mechanisms behind the EBI service. Our 1kgenomess.ga4gh.org server is an example of this sort of thing. It stores the VCF information locally but does a remote URL read from AWS to retrieve Read data as the BAMs were too large to store locally. Let's move the larger Per-service federation discussion out to another thread.

Ensembl / schemas

Use cases feedback - retrieving data from multiple sources #6