ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 91 forks source link

Should duplicated callSetId raise a warning message? #308

Open shajoezhu opened 9 years ago

shajoezhu commented 9 years ago

For command as such python client_dev.py -vv variants-search http://localhost:8000/v0.5.1 -m1 -c 1000g_2013.HG00096,1000g_2013.HG00096 -V 1000g_2013 -r 1 -e 100000 2>&1 | less should ga4gh server raise a warning message to notify client that duplicated callSetId was called?

jeromekelleher commented 9 years ago

Excellent question! The protocol has nothing to say about whether duplicate callSetIds should be allowed, so I guess there's nothing particularly wrong with it. I would say it's OK, and the client and server should just do as they are asked.

Anyone else like to weigh in here? Should the protocol have something to say about this?

dcolligan commented 9 years ago

There's a few ways that we could tackle this:

0) the server executes the request, but throws an error upon being asked to execute request that involves a callSetId that it has already seen 1) the server throws an error about a duplicate callSetId before executing the query 2) the server issues a warning about a duplicate id but proceeds with the request (we would need to clarify what we mean by "issues a warning") 3) the server proceeds with the request, dumbly processing each callSetId (that is, returning duplicate results for the duplicate callSetId) 4) the server proceeds with the request, filtering out duplicate callSetIds (deleting subsequent duplicates in the list after the first one is seen) before executing the query

@jeromekelleher is right that this is a more general issue. It really applies anywhere where the user is asked to submit a list of something in the request. We should probably set a convention generally.

jeromekelleher commented 9 years ago

Good summary @dcolligan. I think (1) or (3) are the best options. I can see good arguments for both...

dcolligan commented 9 years ago

Is there any scenario in which the user would want the same set of results twice? Is there a reason that we should assume the user doesn't know what s/he is doing in that case?

jeromekelleher commented 9 years ago

That's what I was trying to figure out. The argument is a list rather than a set, so there's nothing semantically wrong with providing duplicates. It's not a trivial cost for us to check for duplicates, so maybe we should assume that the user knows what they are doing in this case.

dcolligan commented 9 years ago

It's not a trivial cost?

if len(set(callSetIds)) != len(callSetIds): raise DuplicateCallSetIdException("lame query, bro")
shajoezhu commented 9 years ago

:+1:

jeromekelleher commented 9 years ago

Well, we could have a few thousand callSetIds in there, so it might be significant. I'm probably prematurely optimising though, so we should ignore that consideration.

My view is that if we stop people doing stupid (but legitimate) things we might also inadvertently stop them doing clever things. I think duplicate callSetIds is stupid, but probably legitimate.

dcolligan commented 9 years ago

(3) also has the advantage of not introducing any protocol changes.

Yay, path of least resistance!

shajoezhu commented 9 years ago

Hi @dcolligan, the reason I raised issue was that, when I ran python server_dev.py in one window, and execute the command python client_dev.py -vv variants-search http://localhost:8000/v0.5.1 -m1 -c 1000g_2013.HG00096,1000g_2013.HG00096 -V 1000g_2013 -r 1 -e 100000 2>&1 | less I get the following:

ga4gh.client: POST http://localhost:8000/v0.5.1/variants/search
ga4gh.client: json request:
ga4gh.client: {
    "callSetIds": [
        "1000g_2013.HG00096", 
        "1000g_2013.HG00096"
    ], 
    "end": 100000, 
    "pageSize": 1, 
    "pageToken": null, 
    "referenceName": "1", 
    "start": 0, 
    "variantName": null, 
    "variantSetIds": [
        "1000g_2013"
    ]
}
ga4gh.client: json response:
ga4gh.client: {
    "nextPageToken": "10234", 
    "variants": [
        {
            "alternateBases": [
                "AC"
            ], 
            "calls": [
                {
                    "callSetId": "1000g_2013.HG00096", 
                    "callSetName": "HG00096", 
                    "genotype": [
                        1, 
                        0
                    ], 
                    "genotypeLikelihood": [], 
                    "info": {}, 
                    "phaseset": "*"
                }
            ], 
            "created": 1427822858860, 
            "end": 10177, 
            "id": "1000g_2013:1:10177", 
            "info": {

so after discuss with @jeromekelleher , this seems to be a bug that should be fixed at the protocol level. The request was for two callSetIds, even though they are identical, but the calls have only returned once. it hasn't been defined clearly that the callSetId should be a set or a list? is this right?

dcolligan commented 9 years ago

AFAIK, there are no sets as attributes in the request/response protocol objects. Everything is a list. (Which makes sense, since JSON doesn't support sets.)

jeromekelleher commented 9 years ago

The protocol itself is very vague about the semantics of the callSetIds array though. All it says is

"Only return variant calls which belong to call sets with these IDs."

Should callSetIds be considered a list or a set? Are we supposed to return the Call objects in the Variant in the same order as they are specified in callSetIds or is the order arbitrary? This is just the sort of detail that can lead to clients writing code to the behaviour of one server, only to find that their code doesn't work when they run it against another. Crap, there goes interoperability.

I'm thinking of something like the following: callSetIds determines the Call objects that are embedded in the returned variant objects. If this is null, calls from all CallSets associated with this VariantSet are returned, and the order in which they occur is arbitrary, but fixed for a given server implementation. If callSetIds is non null, it is interpreted as a list of the IDs of CallSets (which must belong to this VariantSet) for which calls are returned. The order of the Calls in each returned Variant's call field must be the same as the order in which they are given in callSetIds.