chop-dbhi / avocado

Metadata APIs for Django
http://avocado.harvest.io
Other
41 stars 10 forks source link

Database- vs. Programmatically- Defined Customizations #78

Closed bruth closed 10 years ago

bruth commented 11 years ago

Avocado has two core data types, the Field and Concept. Fields map to and provide an interface for Django model fields which map directly to database columns. Additional metadata such as a verbose name, plural name, units, description, etc. can be associated with a Field instance. The query behavior of the instance can also be customized via the translator attribute.

A Concept associates one or more Fields together and is the public view of what data is available for query. A formatter can be assigned to a concept to customize how the data is formatted on the way out of the database. In the simplest case, a single field concept would just let the data pass through as is without modifying it. In more complex cases, a second query to the database could be issued to fetch some complicated related data, a local cache hit could be performed, or even an external service could be interacted with to fill in the necessary data for the concept.

Both translators and formatters are Python classes. There is a base class for each type and provide a default behavior. Custom subclasses can be defined and registered with the respective class registry. For example:

from avocado import formatters

# subclass
class MyFormatter(formatters.Formatter):
    pass

# register the class with a name
formatters.registry.register(MyFormatter, 'My Formatter')

Avocado currently supports defining the translator and formatter at the database level (e.g. in the Django admin). The items registered for each type (as shown above) populate the available choices that can be selected. For example, all translators registered will appear in the translator drop-down in the admin and likewise for the formatters.

The assumption for this design was that it would be easy for a non-technical admin to manage the metadata in the admin interface (most people like a UI) and it would easy to change these attributes on-the-fly if needed.

In theory this sounds like a clever way to bridge the gap between code and data, however in practice there is a disconnect between what data is being acted on (e.g. the fields in a concept) and the implementation of the class (e.g. the formatter).

Given this formatter class:

class MyFormatter(formatters.Formatter):
    def to_html(self, values, **context):
        return u'{} → {}'.format(values['foo'], values['bar'])
    to_html.process_multiple = True

The formatter is making two assumptions:

This makes the formatter tied to concepts of a very specific type. It is arguably not appropriate to have this formatter in the drop-down list in the admin since a user could select the formatter for a concept not compatible with the formatter. This leads me to my first question and concern.

If users managing the data in the admin are not assumed to know the underlying implementations of the formatters and translators, how can the above stated mishap be prevented?

Without very sufficient, non-technical documentation (that may lead to more confusion), I am not sure this issue can be prevented. In most cases formatters and translators are very tied to the particular one or few instances they apply to. Making them available as a drop-down list in an admin interface makes it seem they can be arbitrarily changed would certainly lead to issues.

murphyke commented 11 years ago

Jeremy can weigh in, but I don't think this has been a problem on the PCGC project. In practice, we haven't had many concepts that require custom formatting. Since 1) we haven't encountered cases where we have foreseen that two or more formatters might reasonably apply to a given concept, and 2) there isn't a way to dynamically upload formatters, there would be no disadvantage for us in wiring formatters to concepts in the code, since we have to modify the code and restart the web server anyway whenever we make a formatter change.

An alternative would be typing the formatters and translators to one degree or another (number or type of fields, or class of formatter) and using that type information in building the drop-downs. Of course, if you wanted to put maximum power in the hands of the user, you'd have to support uploadable formatter code ;-)

bruth commented 11 years ago

Thanks for the insight @murphyke. Out of curiosity, what power would you expect from uploadable formatter code? What restrictions would be in place to prevent someone (theoretically) hacking in and uploading arbitrary code? (I've gone against my own rule of getting theoretical).

bruth commented 11 years ago

@murphyke In thinking of an alternative approach, I am sensitive to the fact that PCGC has way more fields and concepts than any other project so far. It's seems like an extreme case, but considering there is only about 4 production Harvest apps.. it's too small of sample size to tell. Roughly how many fields and concepts are defined for PCGC? How many translators are used? How many custom formatters?

murphyke commented 11 years ago

what power would you expect from uploadable formatter code? What restrictions would be in place?

Hence the smiley in my original.

How many X?

20 custom formatters. 2 translators. 60 categories. 509 fields. 422 criteria. 370 columns.

bruth commented 11 years ago

@murphyke thanks for the stats

bruth commented 11 years ago

@murphyke what is your overall assessment of the generality of the formatters? are they very specific to the criterion concept they apply to? which formatter has the most concepts associated with it? (same questions for translators)

bruth commented 11 years ago

Another aspect of defining/augmenting the concept-field relationship programmatically is the inability to represent fields that do not map to a Django model fields such as computed fields, e.g. class methods and properties that act on other data. For example, I am implementing the SIFT and PolyPhen2 formatters in Varify to include both the raw score and the prediction text, e.g. 'Damaging', 'Tolerated', etc. The prediction text is not a real column, but is computed. This clogs up the formatter a bit since the computed field needs to be defined and inserted in the formatter. The base Formatter class may be the appropriate place to put this logic.. e.g. define additional fields or data prior to running it through the formatter methods. Just a thought.

leipzig commented 11 years ago

I think offering more standardized safe formatter syntax and examples would be helpful then you wouldn't have everyone inventing their own systems for representing intracell tables using columns with '|' or row breaks with '$' and then having to come up with CSS to address width issues.

bruth commented 10 years ago

Similar to #88 in practice