RevolutionAnalytics / ravro

9 stars 11 forks source link

failing on complicated schemas #2

Open mattpollock opened 10 years ago

mattpollock commented 10 years ago

Hello,

I tested read.avro using a moderately complicated schema. Some fields contain sub-records, other fields contain arrays of records. One of the sub-records (named moments and containing mean, variance, skewness, and kurtosis fields) is defined the first time and referenced as a type subsequently. This does not cause avro any problems, but read.avro throws the following error:

> dat <- read.avro(file="/path/to/file/part-r-00000.avro")
Error in (function (x, schema, flatten = T, simplify = F, encoded_unions = T,  : 
  Unsupported Avro type: moments

The schema being read here reads (in part):

...
"fields" : [ {
    "name" : "routename",
    "type" : "string",
    "doc" : "path identifier indicates unique fix sequence"
  }, {
    "name" : "aircrafttype",
    "type" : "string"
  }, {
    "name" : "lowaltitudebin",
    "type" : "double",
    "doc" : "altitude [feet] at low end of route (rounded to nearest 1000ft)"
  }, {
    "name" : "highaltitudebin",
    "type" : "double",
    "doc" : "altitude [feet] at high end of route (rounded to nearest 1000ft)"
  }, {
    "name" : "route",
    "type" : [ "null", {
      "type" : "record",
      "name" : "routemetrics",
      "fields" : [ {
        "name" : "route",
        "type" : [ "string", "null" ]
      }, {
        "name" : "initialalttude",
        "type" : [ "null", {
          "type" : "record",
          "name" : "moments",
          "fields" : [ {
            "name" : "mean",
            "type" : "double"
          }, {
            "name" : "variance",
            "type" : "double"
          }, {
            "name" : "skewness",
            "type" : "double"
          }, {
            "name" : "kurtosis",
            "type" : "double"
          }, {
            "name" : "samplesize",
            "type" : "long"
          } ]
        } ],
        "doc" : "moments [feet] characterizing distribution of atltitudes at the beginning of the route (within given binning constraint)"
      }, {
        "name" : "terminalaltitude",
        "type" : [ "null", "moments" ],
        "doc" : "moments [feet] characterizing distribution of atltitudes at the end of the route (within given binning constraint)"
      }, {...

Note that moments is defined as a type (as part of a union) for the first time in the initialalttude field, which is a field of the routemetrics record nested inside of the top-level route field. After that, moments is referenced by name in the subsequent terminalaltitude field.

Are there any plans to deal well with schemas like the one above?

jamiefolson commented 10 years ago

If I recall correctly, we were primarily focused on Avro files with the schema embedded. In that case, at least for the data we tested, record schemas were duplicated everywhere they appear in the schema("moments" would be defined in both places). It seems this not the case for the schema metadata in your Avro files?

mattpollock commented 10 years ago

The data was generated using PIG. When I attempted to explicitly define moments throughout the schema it threw errors (protecting against my giving the same name to different types of records I think). I assumed that this was not unique to the avro/PIG handshake, but a general avro schema requirement. Perhaps that isn't the case.

Regardless, the way I defined the schema when saving the data and the way it pops out when using avro-tools getschema on a resulting data file (which is what I pasted above) are consistent, defining moments only once. This does not cause any hiccups for avro-tools tojson. Also, messing around with the java API, calling fld.schema().getFields() (where fld is an object of type org.apache.avro.Schema.Field) on fields where moments is the type but is not explicitly defined (e.g., in the case of the terminalaltitude field above) returned the expected fields (mean, variance, etc.) without any problem.