"Literate" API for OSM data processing

mojodna commented 5 years ago

This begins the process of providing a more user-friendly API for processing OSM data. A variety of discrete traits are provided to provide clear indicators of what data is available within a given Dataset and allows types to provide indications on what pre-defined actions are available.

The driving motivations are:

indicate whether a Dataset includes historical data (with History) or is a snapshot (Snapshot[Dataset[T]]), possibly with the Timestamp (or replication sequence) it applies to (Snapshot[Dataset[T]] with Timestamp).
provide granular traits/interfaces (VersionControl (changeset),Authorship(uid/user),Identity (id / version) that can be composed and used to provide compile-time constraints on which methods can be applied to certain types of Datasets (i.e. .asPoints shouldn't apply to Dataset[Way]).
attempt to preserve extra columns that aren't necessarily implied by a Dataset's type parameter(s).

E.g.:

import osmesa.common._

val HistoryDF = spark.read.orc(???)
val history: Dataset[OSM] with History = asHistory(HistoryDF)
val ways: Dataset[Way with Timestamp] with History = history.ways

implicit val nodes: Dataset[Node with Timestamp] with History = history.nodes

val geoms: Dataset[OSMFeature[jts.Geometry] with GeometryChanged with MinorVersion with Tags with Validity] with History = 
  ways.withGeometry

I could use some feedback on how to break this up, particularly the traits, the implementations (for serialization), and implicits (extensions). There are some name collisions that require imports to be more explicit than they should be otherwise, and moving traits, implementations, etc. to their own packages / classes / objects will likely help.

Naming is hard. Ideas and suggestions for conventions are appreciated.

mojodna commented 5 years ago

For dynamic generation of Encoders within generic methods, I have:

private def buildEncoder[T](implicit tag: TypeTag[T]): Encoder[T] = {
  EncoderCache
    .getOrElseUpdate(
      tag, {
        val pkg = "osmesa.common.impl"

        val traits = traitsIn[T]

        val name = s"$pkg.${traits.map(_.name.toString).mkString("With")}"

        // https://stackoverflow.com/a/23792152/507685
        val c = try {
          Class.forName(name) // obtain java.lang.Class object from a string
        } catch {
          case e: ClassNotFoundException =>
            throw new RuntimeException(
              s"${e.getMessage} must be an implementation of the following traits: ${traits
                .map(_.name.toString)
                .mkString(", ")}")
          case e => throw e
        }

        val mirror = runtimeMirror(c.getClassLoader) // obtain runtime mirror
        val sym = mirror.staticClass(name) // obtain class symbol for `c`
        val tpe = sym.selfType // obtain type object for `c`

        // create a type tag which contains the above type object
        val targetType = TypeTag(
          mirror,
          new api.TypeCreator {
            def apply[U <: api.Universe with Singleton](m: api.Mirror[U]): U#Type =
              if (m == mirror) tpe.asInstanceOf[U#Type]
              else
                throw new IllegalArgumentException(
                  s"Type tag defined in $mirror cannot be migrated to other mirrors.")
          }
        ).asInstanceOf[TypeTag[Product]]

        Encoders.product(targetType)
      }
    )
    .asInstanceOf[Encoder[T]]
}

// determine closest traits
def traitsIn[T](implicit tag: TypeTag[T]): Set[TypeSymbol] = {
  val tpe = tag.tpe

  val t = tpe.baseClasses.filter(s => s.isAbstract && s != typeOf[Any].typeSymbol).map(_.asType)

  t.foldLeft(Seq.empty[TypeSymbol]) {
      case (acc, x) => {
        // if x is a super type of anything in acc, skip it
        if (acc.exists(y => y.toType <:< x.toType)) {
          acc
        } else {
          // filter out anything in acc that's a super type of x
          acc.filterNot(y => x.toType <:< y.toType) :+ x
        }
      }
    }
    .sortBy(_.toString)
    .toSet
}

This assumes that osmesa.common.impl.<case class> exists (since these can't be created at runtime) and is named alphabetically.

This then facilitates:

implicit class HistoricalNodeWithTimestampDatasetExtension[T <: Node with Timestamp](
    history: Dataset[T] with History) {
  import history.sparkSession.implicits._

  def withValidity[U >: Node with Validity](tag: TypeTag[U]): Dataset[U] with History = {
    implicit val encoder: Encoder[U] = buildEncoder[U]

    history.withValidityInternal
      .as[U]
      .asInstanceOf[Dataset[U] with History]
  }
}

The case class NodeWithValidity will be used under the hood by Spark.

The goal is to define implicit classes with the core required fields as type parameters and allow them to be used by additionally refined types, e.g., Node with GeometryChanged with Timestamp, with similarly-enhanced return types.

I.e.

Node with Timestamp → Node with Validity Node with GeometryChanged with Timestamp → Node with GeometryChanged with Validity

I think changing the class signature to this is part of the equation:

implicit class HistoricalNodeWithTimestampDatasetExtension[T <: Timestamp](
    history: Dataset[Node with T] with History)

I'm running into a type-related problem: I want the return type of a method to be a subset of the class's type parameters. I.e. Node with __ with Timestamp → Node with __ with Validity or Node with __ with Validity → Point with __ with Validity.

Is this even possible without needing to write boilerplate for each combination?

mojodna commented 5 years ago

This compiles (and helps, but doesn't totally eliminate the boilerplate):

package osmesa.common
import org.apache.spark.sql.Dataset
import osmesa.common.traits.{GeometryChanged, Node, Point, Validity}

import scala.reflect.runtime.universe.TypeTag

object Scratch extends App {
  implicit class ValidityPreservingDatasetExtension[T](ds: Dataset[Node with T])(
      implicit evidence: T <:< Validity) {
    def asPoints[R >: Point with T](implicit tag: TypeTag[R]): Dataset[R] = ???
  }

  implicit class GCwV(ds: Dataset[Node with GeometryChanged with Validity])
      extends ValidityPreservingDatasetExtension[GeometryChanged with Validity](ds)

  val f: Dataset[Point with GeometryChanged with Validity] =
    ???.asInstanceOf[Dataset[Node with Validity with GeometryChanged]].asPoints
  val g: Dataset[Point with Validity] = ???.asInstanceOf[Dataset[Node with Validity]].asPoints
}

However, IntelliJ's parser doesn't recognize this as valid:

implicit class GCwV(ds: Dataset[Node with GeometryChanged with Validity]) extends ValidityPreservingDatasetExtension[GeometryChanged with Validity](ds) reports: "Type mismatch, expected: org.apache.spark.sql.Dataset[osmesa.common.traits.Node with osmesa.common.traits.GeometryChanged with osmesa.common.traits.Validity], actual: org.apache.spark.sql.Dataset[osmesa.common.traits.Node with osmesa.common.traits.GeometryChanged with osmesa.common.traits.Validity]" (yes, they're the same). I think the Dataset[Node with T] constructor param is causing the problem.

???.asInstanceOf[Dataset[Node with Validity with GeometryChanged]].asPoints can't resolve asPoints, presumably because of the first error (in the editor only; it compiles fine).

Since a large part of the reason to clean up the API is to make exploring possibilities work through auto-complete, this is a bit of a bummer...

Thoughts?

mojodna commented 5 years ago

Minimized repro, triggered by the container (reporting to JetBrains):

trait Container[T]
trait Concrete
trait A
trait B

class Parent[T](t: Container[Concrete with T])
class Child(t: Container[Concrete with A with B]) extends Parent[A with B](t)

mojodna commented 5 years ago

https://youtrack.jetbrains.net/issue/SCL-14527

mojodna commented 5 years ago

Potential partial workaround:

trait Container[T]
trait Concrete
trait A
trait B

type AB = A with B

class Parent[T](t: Container[Concrete with T])
class Child(t: Container[Concrete with AB]) extends Parent[A with B](t)

mojodna commented 5 years ago

In practice, I never actually triggered the IntelliJ bug in d45add5. Using additional type parameters helped things dramatically; when extending one of the functionality-adding traits in an implicit extension class, one can include a base type for what's returned. If it needs to be more specific, a new, more specific extension class can be created.

mojodna commented 5 years ago

To use this in a Spark REPL:

sbt "project common" assembly
spark --jars common/target/scala-2.11/osmesa-common-assembly-0.1.0.jar

import osmesa.common._
import osmesa.common.implicits._

val orc = spark.read.orc("common/src/test/resources/disneyland.osh.orc")
val osm = asHistory(orc)

osm.nodes
// org.apache.spark.sql.Dataset[osmesa.common.traits.Node with osmesa.common.traits.Timestamp] = [id: bigint, tags: map<string,string> ... 8 more fields]

osm.⇥

mojodna commented 5 years ago

This didn't work out as hoped. The combination of Spark Dataset capabilities + Scala's type system made it look like we'd be able to implement a form of granular lenses on top of the OSM data model while adding some level of type safety to functions that accept data in varying forms as input. However, the further we got into this, the more the plumbing got in the way and introduced unrelated complexity. Additionally, dynamically subtracting traits from a list of refinements (within the type system) proved to be impossible. Without that (even accepting underlying complexity), the boilerplate burden was just too high.

I / we did learn a whole lot from the process, much of which has already been merged into OSMesa in various forms.

azavea / osmesa

"Literate" API for OSM data processing #103