Open shajoezhu opened 9 years ago
Excellent question! The protocol has nothing to say about whether duplicate callSetIds should be allowed, so I guess there's nothing particularly wrong with it. I would say it's OK, and the client and server should just do as they are asked.
Anyone else like to weigh in here? Should the protocol have something to say about this?
There's a few ways that we could tackle this:
0) the server executes the request, but throws an error upon being asked to execute request that involves a callSetId that it has already seen 1) the server throws an error about a duplicate callSetId before executing the query 2) the server issues a warning about a duplicate id but proceeds with the request (we would need to clarify what we mean by "issues a warning") 3) the server proceeds with the request, dumbly processing each callSetId (that is, returning duplicate results for the duplicate callSetId) 4) the server proceeds with the request, filtering out duplicate callSetIds (deleting subsequent duplicates in the list after the first one is seen) before executing the query
@jeromekelleher is right that this is a more general issue. It really applies anywhere where the user is asked to submit a list of something in the request. We should probably set a convention generally.
Good summary @dcolligan. I think (1) or (3) are the best options. I can see good arguments for both...
Is there any scenario in which the user would want the same set of results twice? Is there a reason that we should assume the user doesn't know what s/he is doing in that case?
That's what I was trying to figure out. The argument is a list rather than a set, so there's nothing semantically wrong with providing duplicates. It's not a trivial cost for us to check for duplicates, so maybe we should assume that the user knows what they are doing in this case.
It's not a trivial cost?
if len(set(callSetIds)) != len(callSetIds): raise DuplicateCallSetIdException("lame query, bro")
:+1:
Well, we could have a few thousand callSetIds in there, so it might be significant. I'm probably prematurely optimising though, so we should ignore that consideration.
My view is that if we stop people doing stupid (but legitimate) things we might also inadvertently stop them doing clever things. I think duplicate callSetIds is stupid, but probably legitimate.
(3) also has the advantage of not introducing any protocol changes.
Yay, path of least resistance!
Hi @dcolligan, the reason I raised issue was that, when I ran python server_dev.py in one window, and execute the command python client_dev.py -vv variants-search http://localhost:8000/v0.5.1 -m1 -c 1000g_2013.HG00096,1000g_2013.HG00096 -V 1000g_2013 -r 1 -e 100000 2>&1 | less
I get the following:
ga4gh.client: POST http://localhost:8000/v0.5.1/variants/search
ga4gh.client: json request:
ga4gh.client: {
"callSetIds": [
"1000g_2013.HG00096",
"1000g_2013.HG00096"
],
"end": 100000,
"pageSize": 1,
"pageToken": null,
"referenceName": "1",
"start": 0,
"variantName": null,
"variantSetIds": [
"1000g_2013"
]
}
ga4gh.client: json response:
ga4gh.client: {
"nextPageToken": "10234",
"variants": [
{
"alternateBases": [
"AC"
],
"calls": [
{
"callSetId": "1000g_2013.HG00096",
"callSetName": "HG00096",
"genotype": [
1,
0
],
"genotypeLikelihood": [],
"info": {},
"phaseset": "*"
}
],
"created": 1427822858860,
"end": 10177,
"id": "1000g_2013:1:10177",
"info": {
so after discuss with @jeromekelleher , this seems to be a bug that should be fixed at the protocol level. The request was for two callSetIds, even though they are identical, but the calls have only returned once. it hasn't been defined clearly that the callSetId should be a set or a list? is this right?
AFAIK, there are no sets as attributes in the request/response protocol objects. Everything is a list. (Which makes sense, since JSON doesn't support sets.)
The protocol itself is very vague about the semantics of the callSetIds
array though. All it says is
"Only return variant calls which belong to call sets with these IDs."
Should callSetIds
be considered a list or a set? Are we supposed to return the Call
objects in the Variant
in the same order as they are specified in callSetIds
or is the order arbitrary? This is just the sort of detail that can lead to clients writing code to the behaviour of one server, only to find that their code doesn't work when they run it against another. Crap, there goes interoperability.
I'm thinking of something like the following: callSetIds determines the Call objects that are embedded in the returned variant objects. If this is null, calls from all CallSets associated with this VariantSet are returned, and the order in which they occur is arbitrary, but fixed for a given server implementation. If callSetIds is non null, it is interpreted as a list of the IDs of CallSets (which must belong to this VariantSet) for which calls are returned. The order of the Calls in each returned Variant's call field must be the same as the order in which they are given in callSetIds.
For command as such
python client_dev.py -vv variants-search http://localhost:8000/v0.5.1 -m1 -c 1000g_2013.HG00096,1000g_2013.HG00096 -V 1000g_2013 -r 1 -e 100000 2>&1 | less
should ga4gh server raise a warning message to notify client that duplicated callSetId was called?