biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
113 stars 20 forks source link

Return consistent types #42

Closed buchanae closed 5 years ago

buchanae commented 6 years ago

Looks like the types of some fields can change between string, list, or non-existent. For example, "ensembl" and "alias".

This gets tedious to work with on the client side. I'll need to add checks for these fields now, which will inevitably end up as a helper library I use to wrap all calls. It'd be a lot easier if the API adhered to a strict schema.

newgene commented 6 years ago

I will point you to this previous issue: https://github.com/biothings/mygene.info/issues/7

buchanae commented 6 years ago

I see. Well, I think you should reconsider the amount of work you're putting on to your clients, especially those of us with statically typed languages. It's also very easy to miss, and therefore leads to subtly broken code (and therefore subtly broken analysis results). Within one week of using this service, I've already encountered this with a coworker.

But, I respect your decision, and still appreciate the great work you guys are doing!

newgene commented 6 years ago

@buchanae I understand that the current data structure will cause more troubles on a statically typed language (Java) than a dynamically typed one (Python). We will think of something we can do at API-level (not the underlying data structure level) to help address this issue.

One thing in my mind is to add a query parameter like alwayslist=field1,field2 to enforce any field to be a list, basically, bring that alwayslist helper function into the API. Similarly, we can also add an allownull=field1,field2 to make a non-existing field as a null value instead.

Let us know how you think. We are happy to hear the feedback from the statically-typed language side.

buchanae commented 6 years ago

That's a clever way to apply a fix without breaking the API, nice!

Although, personally I'd just want a single flag (or separate endpoint) that coerces the whole data structure into consistent types.

newgene commented 6 years ago

Yes, that will be ideal. And this requires us to maintain a complete data schema (it's on our roadmap anyway), more of a long-term solution. We will have some internal discussions on this and figure out how we should prioritize this.

buchanae commented 6 years ago

@newgene Awesome, thanks for taking this into consideration.

stuppie commented 5 years ago

This was a very annoying thing for me to deal with as well. I implemented a solution using cerberus to validate the json doc and perform coercion on the fields I was interested in using before accessing the data. Example:

from cerberus import Validator

def alwayslist(value):
    """If input value if not a list/tuple type, return it as a single value list."""
    if value is None:
        return []
    if isinstance(value, (list, tuple)):
        return value
    else:
        return [value]

mygene_schema = {
    'entrezgene': {'type': 'integer', 'coerce': int, 'required': True},
    'taxid': {'type': 'integer', 'required': True, 'coerce': int},
    'SGD': {'type': 'string', 'required': False},
    'HGNC': {'type': 'string', 'required': False},
    'MIM': {'type': 'string', 'required': False},
    'MGI': {'type': 'string', 'required': False},
    'locus_tag': {'type': 'string', 'required': False},
    'RGD': {'type': 'string', 'required': False},
    'FLYBASE': {'type': 'string', 'required': False},
    'WormBase': {'type': 'string', 'required': False},
    'ZFIN': {'type': 'string', 'required': False},
    'ensembl': {
        'type': 'list', 'coerce': alwayslist, 'required': False,
        'schema': {
            'type': 'dict',
            'allow_unknown': True,
            'required': False,
            'schema': {'gene': {'type': 'string'},
                       'protein': {'type': 'list', 'coerce': alwayslist, 'schema': {'type': 'string'}},
                       'transcript': {'type': 'list', 'coerce': alwayslist, 'schema': {'type': 'string'}},
                       },
        }
    }
}

v = Validator(mygene_schema, allow_unknown=True)
v.validate(d)
if v.errors:
    raise ValueError(v.errors)
return v.document
cyrus0824 commented 5 years ago

Hi @buchanae , the feature that @newgene described above is implemented on mygene.info.

the always_list parameter will make a field always be a list (if it exists), the allow_null parameter will insert a field into the document as null if it doesn't exist.

Here's an example:

http://mygene.info/v3/query?q=cdk?&always_list=_id&fields=_id&allow_null=test.test_field

It's still very much a work-in-progress, so I'll leave the issue open until you get a chance to test it and leave feedback.

buchanae commented 5 years ago

This is the python code I used to try this feature:

requests.get("http://mygene.info/v3/query", params={
        "q": "symbol:" + gene,
        "fields": "ensembl.gene,symbol,genomic_pos",
        "species": "human",
        "always_list": "ensembl, genomic_pos",
    })

Seems to work. Thank you for doing this.

I would prefer to have one parameter that set always_list and allow_null for all fields, but I can also build a client-side helper to do that. Thoughts?

newgene commented 5 years ago

@buchanae Great! Glad this works for you. And we are indeed planning on a more "formal" solution for this. I created a new issue: #52 to track the progress and will close this issue.