cidgoh / geem

Genomic Epidemiology Entity Mart
Creative Commons Attribution 4.0 International
3 stars 1 forks source link

Fetch items within package #32

Closed ivansg44 closed 5 years ago

ivansg44 commented 5 years ago

@ddooley

I have investigated the possibility of fetching items from packages through the API, without loading the entire specifications field to JSON.

api/resources/{pk}/specifications/?format=json produces a JSON object containing the entire specifications field of a package with id {pk}.

api/resources/{pk}/specifications/?format=json&id={id} produces a JSON object containing a single term with with id {id} from the specifications. This does not load the entire specifications to memory before filtering it down to one item. Instead, it constructs a QuerySet for that specific item in the specifications JSON object. QuerySet objects do not touch the database until they are actually evaluated. In the context of this pull request, the QuerySet object is evaluated at line 108 of geem/views.py, which is after I have specified I only want one term extracted.

There is no "contained" way to add id to the URL in the shape of api/resources/{pk}/specifications/{id}. We would have to hard-code it into api/urls.py, as seen in the Django documentation here. Let me know if having id as a query parameter is acceptable, or you want it hard-coded into a URL.

ddooley commented 5 years ago

Looks like great progress. I see the resource "content" field is a JSONField, specifically a jsonb field, which means fast access via postgress:

line 73, https://github.com/GenEpiO/geem/blob/master/geem/models.py

contents = JSONField() # Note, this takes a while to save because postgres creates queryable structure of contents?

How do we make sure it is "GiST" indexed as described at top of: https://docs.djangoproject.com/en/2.1/ref/contrib/postgres/fields/#django.contrib.postgres.fields.JSONField

"Index and Field.db_index both create a B-tree index, which isn’t particularly helpful when querying complex data types. Indexes such as GinIndex and GistIndex are better suited, though the index choice is dependent on the queries that you’re using. Generally, GiST may be a good choice for the range fields and HStoreField, and GIN may be helpful for ArrayField and JSONField."

As for the id={id} parameter, that's ok, but ideally we would put in api/urls.py a path to this instead. But it depends on all the operations - get / push / delete that will be piled on soon. What would they look like as standard REST?. There would probably be the need in the future for multiple item return, i.e. ids=x,y,z . We can discuss tomorrow.

ivansg44 commented 5 years ago

@ddooley

See latest commit.

ddooley commented 5 years ago

Nice; in future id as parameter might support multiple identifiers so it makes sense to have that as a 2nd way.