EarthLifeConsortium / pilot_api

Early pilot version of an API
MIT License
1 stars 0 forks source link

Clarifying flat vs. structured data responses #7

Open SimonGoring opened 8 years ago

SimonGoring commented 8 years ago

Hi @mmcclenn & @jpjenk I just want to clarify the discussion we had about flat data structures in the API response.

Right now, regardless of data format (json, xml, csv), we are returning data as a flat table.

I understand the motivation for doing this for csv formats, but the JSON and XML formats are designed to return structured data, so I'm not clear why we wouldn't use this in that case.

For example, the bibJSON schema for publications is designed to support (for example) variable length author lists, or sets of publications with differing reference structures.

Given the extent of repetition and the potentially large size of some of our responses it might make sense to consider structured data formats for some of the responses, particularly since we're making our users define the response type they're expecting.

For example, a publication response in JSON would use the bibJSON standard, while in CSV is would be wide table that could be saved as csv.

My thinking is two-fold:

  1. I want to avoid repetition in the response as much as possible. Even structuring the API response for occurrences:
{
"elapsed_time":14.8,
"warnings":[
"Neotoma: Request failed",
"Neotoma:  WKT not properly formatted: Polygon((-180 -90,10 -90,10 180,-180 180,-180 -90))"
],
"records": [
{"Database":"PaleoBioDB","OccurrenceID":"pbdb:occ:94749","RecordType":"Occurrence","TaxonName":"Busycon","TaxonID":"pbdb:txn:10874","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"},
. . . 
{"Database":"PaleoBioDB","OccurrenceID":"pbdb:occ:94752","RecordType":"Occurrence","TaxonName":"Busycotypus canaliculatus","TaxonID":"pbdb:txn:94432","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"}]}

versus:

{
"elapsed_time":14.8,
"records": [
{"Database":"PaleoBioDB","occurrences":[{"OccurrenceID":"pbdb:occ:94749","RecordType":"Occurrence","TaxonName":"Busycon","TaxonID":"pbdb:txn:10874","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"},
. . . 
{"OccurrenceID":"pbdb:occ:94752","RecordType":"Occurrence","TaxonName":"Busycotypus canaliculatus","TaxonID":"pbdb:txn:94432","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"}]}]}

saves us an astounding 24 bytes per row :) Which isn't that much, I suppose, but then we could add a bit more structure, returning a taxon table for multi-taxon responses that would link the taxon IDs to the names, so we wouldn't need to repeat those as well. I think we'd see performance improvements in the downstream applications that use the application, particularly web based services that use JSON natively.

Tagging @spatialit as well.

SimonGoring commented 8 years ago

I thought of better examples. . . The example above seems trivial :)

mmcclenn commented 8 years ago

Simon, I have a couple of reasons for returning everything as a flat table.

1) To decouple the response from the response format. You can get any response you like as either CSV or JSON, and you get exactly the same data from each.

2) To simplify the server code. It makes things SO MUCH EASIER if the server just has to generate a list of records. Trying to format things into a complicated JSON structure makes the code more complicated and slows everything down.

That said, there is no reason why we couldn't use bibJSON and format the records in a more natural way. That actually wouldn't complicate things much because each record is still a separate JSON string. In fact, that is a very good idea. We could add bibJSON as a vocabulary option when returning publications.

I am a lot more skeptical about, for example, listing sites and having a sub-list of occurrences under them. That is a good example of something that would complicate the server code. I would much rather implement this as two separate calls: one to list the sites, and one to list the occurrences, with the latter including a siteID field so that you can match up which occurrence goes to which site.