ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 93 forks source link

SearchReads accepts only one ReadGroup ID, contrary to schema #596

Open hjellinek opened 9 years ago

hjellinek commented 9 years ago

The schemas say that SearchReadsRequest accepts at least one ReadGroup ID, but the ref server accepts at most one.

record SearchReadsRequest {
   /**
   The ReadGroups to search. At least one readGroupId must be specified.
   */
   array<string> readGroupIds;
   ...
}

This test case is from a forthcoming compliance test, ReadsSearchIT.searchReadsWithAllIdsReturnsReadsForEach:

threeReadGroupIds = getThreeReadGroupIdsSomehow(); // details omitted (obviously :-))
SearchReadsRequest request =
                SearchReadsRequest.newBuilder()
                                  .setReadGroupIds(threeReadGroupIds)
                                  .setStart(0L)
                                  .setEnd(150L)
                                  .setReferenceId(refId)
                                  .build();
SearchReadsResponse response = client.reads.searchReads(request);

The call to searchReads throws a GAException:

{"errorCode":1151425451,"message":"Exactly one read group id must be specified"}

with HTTP status 501 (NOT_IMPLEMENTED).

This is contrary to the comments in the schema, not to mention that the declaration of the readGroupIds field is array<string>, not a single string.

jeromekelleher commented 9 years ago

This is an intentional limitation of the reference server, as this feature is too complicated for us to implement at the moment. We will try to implement it at some point, but it is difficult to do with our stateless paging architecture.

The reference server's compliance score should be appropriately reduced for this.

david4096 commented 8 years ago

We've just proposed a change to the schemas that would make this feature "implementable". @diekhans

https://github.com/ga4gh/schemas/pull/570

diekhans commented 8 years ago

There was strong objection to removing the cross-BAM merge-sort of reads from the GA4GH API. There are very good use cases for this in multi-sample variant calling and it is supported by htslib.

There should be no requirement to scale beyond small number of BAMs for the reference server.

It's very important the primary single BAM be supported very soon. This is obtaining all readgroups within a BAM as part of a single requests

dcolligan commented 8 years ago

@jeromekelleher can you elaborate on why the limitation was put in place and what the roadmap to remove it would look like?

dcolligan commented 8 years ago

From a @diekhans email:

Right now, we don't support the single most important use cases, which is read all the readgroups in a single BAM. The reference server can make the requirement that BAMs are sorted by location across all readgroups; this is almost always the case. It needs to implement the full API, even if this requires excluding some data.

[We need to] implement multiple readgroups within a single readgroupset. That should be easy. The API is not useful as-is. Cross-BAM reads is a bit more complex, as one needs to implement a merge.

dcolligan commented 8 years ago

It seems like the searchReads endpoint is poorly suited to handle the "give me all reads in a BAM" functionality since it takes a readGroupId (or hopefully, in the future, an array of readGroupIds) argument. Since a ReadGroupSet maps to a BAM, wouldn't an endpoint that took a readGroupSetId be more appropriate? (And as a comment asks in the schema PR, what is a null readGroupSetId array supposed to mean?)

david4096 commented 8 years ago

The schema change I proposed above is similar to what you are suggesting @dcolligan, but it was a feature roll-back that wasn't going to be accepted.

Currently the endpoint takes an array of readgroupIds, it's just that the server currently under-implements what is in the schema. https://github.com/ga4gh/server/blob/31c41f4ea7b072dba43be79cd4db6111f41eb131/ga4gh/backend.py#L532

The ReadsIntervalIterator could be improved to allow multiple read groups to be specified, instead of being fixed to a parent container. One would then merge the results before returning them as part of _search.

dcolligan commented 8 years ago

I agree that the searchReads endpoint, implemented correctly to take N readGroupIds, could implement the "give me all reads in a BAM" functionality. However, that would require the client to know all of the readGroupIds contained in the BAM. That seems like an inconvenient way to address a BAM, which should ideally be done via a readGroupSetId. It also mandates an extra step for a client: getting a complete list of all of the readGroupIds that are in a readGroupSet/BAM.

From that schema thread it seems like we shouldn't abandon the current readsSearch implementation, just make it conform to the schema (i.e. accept multiple readGroupIds). Which, of course, is what this issue is all about.

However, that still doesn't cleanly address, in my view, the motivating "give me all reads in a BAM" use case. If we feel that use case is important -- and it seems that @diekhans does -- shouldn't we advocate for a new endpoint in the API that implements this functionality?

diekhans commented 8 years ago

Yes, it's inconvenient for the clients. However, lets revisit it independent. We need to move the server forwards.

Danny Colligan notifications@github.com writes:

I agree that the searchReads endpoint, implemented correctly to take N readGroupIds, could implement the "give me all reads in a BAM" functionality. However, that would require the client to know all of the readGroupIds contained in the BAM. That seems like an inconvenient way to address a BAM, which should ideally be done via a readGroupSetId. It also mandates an extra step for a client: getting a complete list of all of the readGroupIds that are in a readGroupSet/BAM.

From that schema thread it seems like we shouldn't abandon the current readsSearch implementation, just make it conform to the schema (i.e. accept multiple readGroupIds). Which, of course, is what this issue is all about.

However, that still doesn't cleanly address, in my view, the motivating "give me all reads in a BAM" use case. If we feel that use case is important -- and it seems that @diekhans does -- shouldn't we advocate for a new endpoint in the API that implements this functionality?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

dcolligan commented 8 years ago

If we allow imposition of additional burdens on the client to solve this issue, our client already does "solve" the problem by making a request to searchReads one readGroupId at a time.

diekhans commented 8 years ago

We are not allowed to pass this off to client. That is very inefficient. These choices were made for very good reasons. Those reasons are not documented. BAMs are sorted by readgroupset, not readgroup.

dcolligan commented 8 years ago

Question: if len(readGroupIds) > 1, in what order should the reads be returned? Should they be returned sorted by coordinate, or by readGroupId (and then by coordinate within readGroupId)? The schema doesn't specify this.

dcolligan commented 8 years ago

And if the later, do we need to return the readGroupId reads in the order in which they were specified in the readGroupIds array, or can we do it in any order?

david4096 commented 8 years ago

I think you want to sort by position.

  The list of matching alignment records, sorted by position.
  Unmapped reads, which have no position, are returned last.

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/readmethods.avdl#L61

diekhans commented 8 years ago

They should be returned sorted by coordinates across all read group. the schema should specify this; I think it's there, but obscure.

However, the server should not sort it. The reference server requirement for serving a BAM can be that all readgroups in a readgroupset are are sorted in a single BAM in sorted order.

This will almost always be the case. it becomes the data providers problem to create a conforming BAM, not the reference servers problem to sort it.

Danny Colligan notifications@github.com writes:

Question: if len(readGroupIds) > 1, in what order should the reads be returned? Should they be returned sorted by coordinate, or by readGroupId (and then by coordinate within readGroupId)? The schema doesn't specify this.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

jeromekelleher commented 8 years ago

I think this is going to be harder than you think @diekhans, but I'll look into it and give an analysis over the next day.

diekhans commented 8 years ago

@jeromekelleher It's always harder than I think and easier than you think! Either way, I owe you a beer.

@dcolligan is also looking at it.

Jerome Kelleher notifications@github.com writes:

I think this is going to be harder than you think @diekhans, but I'll look into it and give an analysis over the next day.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

dcolligan commented 8 years ago

@jeromekelleher and I have been talking about this, but here are some of the things that have come up (I, of course, still want to hear his opinion in this thread).

Firstly, no other API endpoint we have has an array of ids as a mandatory argument (we have some that have arrays as optional arguments). This introduces some problems, mostly of specification. For instance: what happens if some of the ids are invalid, but others are valid? What if there are duplicate ids? Is there a limit on how many ids can be sent, or should we just keep accepting ids until the server runs out of memory? Etc.

Secondly, although the intended use case may be searching for reads within the same BAM, there is nothing stopping the client from sending readGroupIds that are contained in different BAMs. In that case, how should we order the reads that come back? If we do need to return them by position, that would imply sorting (that is, merging) them across different BAMs.

Thirdly, given our stateless architecture, we need to keep track of the readGroupIds that were originally sent since we have no way of storing them on the server. This may be a large list, which means they will need to be transmitted on every request and response in the pageToken (as well as any indicies which are bookkeeping for whatever algorithm we implement). That is a lot of overhead.

All of this seems to point in the direction of instead implementing a readGroupSetSearch endpoint that takes one readGroupSetId as a mandatory argument. In this way, we avoid all of the problems above since we would be at most searching through one BAM.

diekhans commented 8 years ago

pysam shows the code to implement this is: iter = samfile.fetch("seq1", 10, 20) for x in iter: print (str(x)) once one figures out if the list of readgroups are all the groups in the readgroupset, where do the problems lie?

dcolligan commented 8 years ago

That only searches one BAM, not N BAMs. That would be similar to the code that we would have in a readGroupSetSearch endpoint.

jeromekelleher commented 8 years ago

Well, it looks like nobody actually implements multiple BAMs (Google don't apparently) so we may as well follow the crowd here. As an initial pass, how about we implement this as if it was passed a single ReadGroupSetId. That is, there are two legal inputs for the readGroupIds in the searchReads endpoint:

  1. Exactly one readGroupId, which we have implemented now; or
  2. k readgroup ids which correspond to exactly one ReadGroupSet. That is, all the readGroupIds belong to one ReadGroupSet, and all ReadGroups from that set are present. Therefore, we don't have to worry about filtering, and we don't have to worry about merging among an arbitrary number of BAMs. We might relax either of these assumptions in the future, or we might not.

I think the best thing would be to have two independent code paths to deal with this, so that we end up with something like:

if searchByReadGroup:
    intervalIterator = ReadsIntervalIterator(request, readGroup, reference)
else:
    intervalIterator = ReadsIntervalIterator(request, readGroupSet, reference)

In the future, hopefully sanity will prevail and we will split these two things (search through ReadGroup and search through ReadGroupSet) into different endpoints. For now, hopefully this workaround will cover the majority of the use cases that people actually have.

@dcolligan, what do you think, is this implementable? @diekhans, is this sufficient?

diekhans commented 8 years ago

Internally, I would implement this as two paths as would be done with two endpoints. This would optimize the common case. We can propose an API chance in parallel.

The requirements for this part of the API are well considered, just never written down and most of the people involved are no longer contributing.

Danny Colligan notifications@github.com writes:

@jeromekelleher and I have been talking about this, but here are some of the things that have come up (I, of course, still want to hear his opinion in this thread).

Firstly, no other API endpoint we have has an array of ids as a mandatory argument (we have some that have arrays as optional arguments). This introduces some problems, mostly of specification. For instance: what happens if some of the ids are invalid, but others are valid? What if there are duplicate ids? Is there a limit on how many ids can be sent, or should we just keep accepting ids until the server runs out of memory? Etc.

Secondly, although the intended use case may be searching for reads within the same BAM, there is nothing stopping the client from sending readGroupIds that are contained in different BAMs. In that case, how should we order the reads that come back? If we do need to return them by position, that would imply sorting (that is, merging) them across different BAMs.

Thirdly, given our stateless architecture, we need to keep track of the readGroupIds that were originally sent since we have no way of storing them on the server. This may be a large list, which means they will need to be transmitted on every request and response in the pageToken (as well as any indicies which are bookkeeping for whatever algorithm we implement). That is a lot of overhead.

All of this seems to point in the direction of instead implementing a readGroupSetSearch endpoint that takes one readGroupSetId as a mandatory argument. In this way, we avoid all of the problems above since we would be at most searching through one BAM.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

diekhans commented 8 years ago

Yes. that is the first and most important use case.

More complex case is doing the merge. Although it's claimed that htslib support this, so it could be showing this through pysam.

Danny Colligan notifications@github.com writes:

That only searches one BAM, not N BAMs. That would be similar to the code that we would have in a readGroupSetSearch endpoint.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

diekhans commented 8 years ago

CH says google does implement it; but we can do this in in phases. Personally, I would wait till streaming for multi-bam.

Jerome Kelleher notifications@github.com writes:

Well, it looks like nobody actually implements multiple BAMs (Google don't apparently) so we may as well follow the crowd here. As an initial pass, how about we implement this as if it was passed a single ReadGroupSetId. That is, there are two legal inputs for the readGroupIds in the searchReads endpoint:

  1. Exactly one readGroupId, which we have implemented now; or
  2. k readgroup ids which correspond to exactly one ReadGroupSet. That is, all the readGroupIds belong to one ReadGroupSet, and all ReadGroups from that set are present. Therefore, we don't have to worry about filtering, and we don't have to worry about merging among an arbitrary number of BAMs. We might relax either of these assumptions in the future, or we might not.

I think the best thing would be to have two independent code paths to deal with this, so that we end up with something like:

if searchByReadGroup: intervalIterator = ReadsIntervalIterator(request, readGroup, reference) else: intervalIterator = ReadsIntervalIterator(request, readGroupSet, reference)

In the future, hopefully sanity will prevail and we will split these two things (search through ReadGroup and search through ReadGroupSet) into different endpoints. For now, hopefully this workaround will cover the majority of the use cases that people actually have.

@dcolligan, what do you think, is this implementable? @diekhans, is this sufficient?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

dcolligan commented 8 years ago

Ok, it looks like we can compromise on @jeromekelleher 's plan

jeromekelleher commented 8 years ago

Not according to @calbach @diekhans:

https://github.com/ga4gh/schemas/pull/570#issuecomment-194481531

The Google API's SearchReads accepts one or more read group set IDs and returns a paginated stream of reads which are ordered by position.

Note: We do not support searching by arbitrary readGroupIds from different read group sets (only from the same read group set) as we remain skeptical that this is a useful query. We may add support in the future - this would not be a significant amount of work, but it's not been requested.

So, Google allows search on a subset of the ReadGroups from a ReadGroupSet but does not support merge across ReadGroupSets.

I remain entirely unconvinced that merging across ReadGroups can't be don't more effectively and simply on the client side --- users will have to anyway if they want data from different servers. But this is besides the point: @dcolligan, lets go with the plan above if you're happy with it.

calbach commented 8 years ago

Sorry if my reply was misleading. We absolutely support merge across multiple ReadGroupSets, when specified in the readGroupSetIds field (which is not part of the GA4GH spec).

jeromekelleher commented 8 years ago

Interesting... can you supply a link to the docs for this endpoint please @calbach? Thanks for the clarification!

macieksmuga commented 8 years ago

My understanding from the GA4GH DWG call was that, as Mark says, merging results (sorted by position) across multiple readGroups is absolutely an important requirement. There was genuine push-back during that call on any attempt to adjust the API in the direction discussed above, so I don't think that's a real possibility going forward.

Which leaves us with ensuring the reference server does its job, namely, fulfills the requirements of the API: Doing this for multiple readGroups within a single readGroupSet should be relatively trivial for the reference server implementation (as Mark noted above), and the general use-case should certainly be doable, provided the claimed HTSLib support for this functionality exists in pysam (and even without such library-level support, merge-sorting multiple result sets on the fly is doable in the near term).

Also, I'm perhaps misunderstanding @dcolligan's earlier voiced concerns with the ReadGroupIDs in the request. As I understand it, passing in multiple ReadGroupID's per page request is exactly in keeping with the spirit of the API as it currently stands, and is directly analogous to passing in CallSetID's in the variants API. I believe this overhead would be trivially small relative to the volume of data returned, and would presumably be resolved altogether in a future streaming version of the API. Is this correct, or am I missing something important?

calbach commented 8 years ago

@jeromekelleher sure: https://cloud.google.com/genomics/reference/rest/v1/reads/search

dcolligan commented 8 years ago

In a recent conference call, some of the team discussed what would be required to implement a fully-confirming readsSearch endpoint. In my mind, this would involve:

This is mostly rehashing what I already said in this thread here: https://github.com/ga4gh/server/issues/596#issuecomment-199860489

Of course, the necessity of doing this might not be there (or not be a priority), but that can be discussed in the schemas repo. Perhaps an alternative to adding a multi-bam endpoint would be a readsSearch endpoint which only accepts a single readGroupId and a readGroupSetSearch endpoint which only accepts a single readGroupSetId. TBD.

diekhans commented 8 years ago

The majority common case would be all readgroups in all BAMs; so this could be identified upfront and special cased to avoid storing all readgroupids in the token.

Most likely find for a subset of readgroups to be significantly slower.

Danny Colligan notifications@github.com writes:

In a recent conference call, some of the team discussed what would be required to implement a fully-confirming readsSearch endpoint. In my mind, this would involve:

• a new IntervalIterator subclass: we would need to figure out a way to handle potentially cross-bam requests; this will likely require a refactor of IntervalIterator • pageTokens: since a single readGroupId is not being requested but multiple readGroupIds, we need to send the array of requested ids back and forth in a pageToken along with the index of the id we are currently servicing

This is mostly rehashing what I already said in this thread here: #596 (comment)

Of course, the necessity of doing this might not be there (or not be a priority), but that can be discussed in the schemas repo. Perhaps an alternative to adding a multi-bam endpoint would be a readsSearch endpoint which only accepts a single readGroupId and a readGroupSetSearch endpoint which only accepts a single readGroupSetId. TBD.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

kozbo commented 8 years ago

Hmm. I think this issue is about implementing to the current state of the Schemas. While the work so far goes a long way to implementing the schemas, and is enough for the short term, this issue is not complete until they have been implemented. So I think the right thing to do is to leave this issue open until that work is complete too.

dcolligan commented 8 years ago

@kozbo sometimes GitHub automatically closes issues when they are referenced in a PR that is merged... but it always seems to happen to the ones you don't want closed... I agree that this issue should remain open

dcolligan commented 8 years ago

Limited case addressed in #1021, but general case still not resolved

kozbo commented 8 years ago

Taking this issue out of the milestone because the work required for the 1000 Genome support is complete.