datamininggroup / pfa

Portable Format for Analytics
http://dmg.org/pfa/
27 stars 8 forks source link

Filtering rows #2

Closed wwells closed 8 years ago

wwells commented 8 years ago

Hi all, thanks for the telephone conference, I think it was a good start. I asked the question regarding filtering rows because in KNIME I was dealing with that when I rewrote our Missing Value node in KNIME. It now supports PMML, but there you don’t really have the notion of removing rows from your input. So while replacing missing values by other values is easy to specify in PMML, saying „Remove any row that contains a missing value in column X“ is not possible, as far as I know. There are other nodes that also filter rows in KNIME, so having this in PFA would in my opinion be really helpful. Regards, Alexander

jpivarski commented 8 years ago

In PFA there are two ways to skip a row (where I take a "row" to be one instance of the input schema): as an error and as a missing output.

A typical PFA implementation will catch all PFA runtime exceptions (there are only three types: library exception, user-defined exception, and timeout exception) and move on, possibly logging the error. However, how runtime errors are handled is outside the scope of the PFA language and therefore the missing output method is more broadly applicable.

Here's an example of the error-based method. The input has three fields, x, y, z, and x could possibly be missing (which is why its type is a union including null). The scoring engine adds x and y, but this operation could be replaced by any data mining model.

{"input": {"type": "record", "name": "Input", "fields": [
    {"name": "x", "type": ["null", "int"]},
    {"name": "y", "type": "double"},
    {"name": "z", "type": "string"}]},
 "output": "double",
 "action":
     {"ifnotnull": {"x": "input.x"},
      "then": {"+": ["x", "input.y"]},
      "else": {"error": "x is missing", "code": -123}}
 }

If you give this scoring engine {"x": {"int": 3}, "y": 3.14, "z": "hello"} as input, it returns 6.14 as output. If you give it {"x": null, "y": 3.14, "z": "hello"} as input, it raises a user-defined exception with message "x is missing" and optional code -123 (which your PFA implementation can decide what to do with).

Here's an example of the missing-output method.

{"input": {"type": "record", "name": "Input", "fields": [
    {"name": "x", "type": ["null", "int"]},
    {"name": "y", "type": "double"},
    {"name": "z", "type": "string"}]},
 "output": ["null", "double"],
 "action":
     {"ifnotnull": {"x": "input.x"},
      "then": {"+": ["x", "input.y"]},
      "else": null}
 }

Notice that the output must now be ["null", "double"] (enforced by the type-checker).

If you give this scoring engine {"x": {"int": 3}, "y": 3.14, "z": "hello"} as input, it returns {"double": 6.14} as output (that's the proper JSON representation of an instance of a tagged union). If you give it {"x": null, "y": 3.14, "z": "hello"} as input, it returns a missing value, null as output. Your PFA implementation or next stage of processing has to deal with the fact that the output can contain missing values.

The "ifnotnull" special form returns the narrowest possible supertype of the "then" branch type and the "else" branch type. In both examples above, the then branch has type "double". In the error-based method, the "else" branch has a bottom type, a type that cannot have any return value, and the narrowest possible supertype of that with "double" is just "double" (like adding zero). In the missing-output method, the "else" branch has Avro "null" type, a type that can have exactly one return value (null), and the narrowest possible supertype of that with "double" is ["null", "double"].

For convenience, the "ifnotnull" form can project types for multiple variables at once. For instance, if x, y, and z were all nullable with an input schema like

{"type": "record", "name": "Input", "fields": [
     {"name": "x", "type": ["null", "int"]},
     {"name": "y", "type": ["null", "double"]},
     {"name": "z", "type": ["null", "string"]}]}

then you could "ifnotnull": {"x": "input.x", "y": "input.y", "z": "input.z"} to create variables x (type "int"), y (type "double"), and z (type "string") in a single "then" branch without deep nesting.

Also, the impute.* library methods work uniformly variables of type ["null", X] for any X, so they could be applied to the output as easily as fields of the input.

AlexanderFillbrunn commented 8 years ago

For scorers which produce a single value, returning null may be ambiguous, as it could be a missing value or a row to be skipped. When returning a record, which will probably be the case in KNIME, returning null instead of a record would be ok though. Thanks for the quick response!

jpivarski commented 8 years ago

There's a third way, and this is probably the most appropriate (I don't know why I forgot about it yesterday).

The two examples I described above are "method": "map" engines (the default): they produce one output for each input. A "method": "emit" engine produces zero or more outputs for each input. This "emit" is how we filter input in map-reduce jobs (so I have no idea why I forgot about it yesterday, because it's exactly relevant to your question).

Here is a PFA scoring engine that emits a result whenever field x is not missing:

{"input": {"type": "record", "name": "Input", "fields": [
    {"name": "x", "type": ["null", "int"]},
    {"name": "y", "type": "double"},
    {"name": "z", "type": "string"}]},
 "output": "double",
 "method": "emit",
 "action":
     {"ifnotnull": {"x": "input.x"},
      "then": {"emit": {"+": ["x", "input.y"]}}}
 }

In emit-type engines, the last expression in the "action" is ignored, and there are no constraints on the return type of the "action". (In the above case, its type is "null" because there's no "else" clause.) What matters instead is the type of the object passed to the "emit" function: it must be "double" because the "output" type is "double".

The PFA implementation basically has two ways to deal with this: (1) provide a callback function to use as emit, and (2) collect all emitted results into some sort of list and flat-map the lists into one big list. A filter emits zero or one times, so these partial lists would either be empty or contain one object, and the number of outputs would be less than or equal to the number of inputs.

If you're coming from the SQL world (which might have motivated this question), "method": "map" creates simple functions like "sqrt", "method": "emit" creates table-generating functions like "explode", and "method": "fold" creates aggregation functions like "sum" or "avg".

AlexanderFillbrunn commented 8 years ago

Hi Jim, thanks for taking the time to respond so thoroughly! I came across emit today and wanted to ask you, whether that would be the perfect choice of doing filtering. Much more convenient than the derived fields of PMML!

jpivarski commented 8 years ago

If I was thinking properly yesterday, emit is the only answer I would have given. However, the other methods (skip using exceptions and output with missing values) are semantically distinct methods of doing roughly the same thing.