disneystreaming / smithy4s

https://disneystreaming.github.io/smithy4s/
Other
339 stars 68 forks source link

Schema instantiation hangs #537

Closed kubukoz closed 6 months ago

kubukoz commented 1 year ago

Couldn't reproduce it simpler, so here's a ZIO test:

//> using scala "2.13.10"
//> using lib "com.disneystreaming.smithy4s::smithy4s-json:0.16.6"
//> using lib "dev.zio::zio-test:1.0.17"
//> using lib "dev.zio::zio-test-sbt:1.0.17"

import smithy4s.schema.Schema
import smithy.api.JsonName
import smithy4s.api.Discriminated
import smithy4s.http.CodecAPI
import smithy4s.HintMask
import smithy4s.IntEnum
import smithy4s.http.json.JCodec

import zio.test._
import zio.test.Assertion._
import smithy4s.schema.Schema
import smithy.api.JsonName
import smithy4s.api.Discriminated
import zio.test._
import zio.test.Assertion._
import zio.duration._

object JsonSpec extends DefaultRunnableSpec {

  private val codecs = smithy4s.http.json.codecs(HintMask(IntEnum))

  private def encodeString[A](a: A)(schema: Schema[A]): String = {
    val c = codecs.compileCodec(schema)

    new String(codecs.writeToArray(c, a))
  }

  def spec: ZSpec[Environment, Failure] = suite("Command JSON codec")(
    test("struct schema serialization ignores @jsonName") {
      val schema = Schema
        .struct[String]
        .apply(
          Schema.string
            .addHints(JsonName("b"))
            .required[String]("a", identity)
        )(identity)

      assert(encodeString("test")(schema))(equalTo("""{"a":"test"}"""))
    },
    test("unions are always encoded as tagged") {

      val schema = {
        val i = Schema.int.oneOf[Either[Int, String]]("intKey", Left(_))
        val s = Schema.string.oneOf[Either[Int, String]]("stringKey", Right(_))

        Schema
          .union[Either[Int, String]](i, s)(_.fold(i(_), s(_)))
          .addHints(Discriminated("type"))
      }

      assert(encodeString[Either[Int, String]](Left(42))(schema))(
        equalTo("""{"intKey":42}""")
      )
    }
  ) @@ TestAspect.timeout(2.seconds)

}

Feel free to bump the timeout to 10 seconds or more, chances are you'll hit the timeout more often than not - the spec usually doesn't finish.

I bisected this down to 80a5d912ca101266795f5a425d855c3b1ea04f6a.

kubukoz commented 1 year ago

I'm aware that the above is calling the unary overload of compileCodec:

image

and it seems like that implementation creates a new cache for each compilation... so I'm not sure why any concurrent usage (confirmed that the tests run concurrently) would break this.

This suffers from the same problem:

private val cache = codecs.createCache()

private def encodeString[A](a: A)(schema: Schema[A]): String = {
  val c = codecs.compileCodec(schema, cache)

  new String(codecs.writeToArray(c, a))
}
Baccata commented 1 year ago

It's not even compilation that seems to hang, the following is sufficient to make it hang (on my machine anyway):

//> using scala "2.13.10"
//> using lib "com.disneystreaming.smithy4s::smithy4s-core:dev-SNAPSHOT"
//> using lib "dev.zio::zio-test:1.0.17"
//> using lib "dev.zio::zio-test-sbt:1.0.17"

import smithy4s.schema.Schema
import smithy.api.JsonName
import smithy4s.api.Discriminated
import zio.test._
import zio.test.Assertion._
import zio.duration._

object JsonSpec extends DefaultRunnableSpec {

  def spec: ZSpec[Environment, Failure] = suite("Command JSON codec")(
    test("struct schema serialization ignores @jsonName") {
      val schema = Schema
        .struct[String]
        .apply(
          Schema.string
            .addHints(JsonName("b"))
            .required[String]("a", identity)
        )(identity)

      assertTrue(true)
    },
    test("unions are always encoded as tagged") {
      val schema = {
        val i = Schema.int.oneOf[Either[Int, String]]("intKey", Left(_))
        val s = Schema.string.oneOf[Either[Int, String]]("stringKey", Right(_))

        Schema
          .union[Either[Int, String]](i, s)(_.fold(i(_), s(_)))
          .addHints(Discriminated("type"))
      }
      assertTrue(true)
    }
  ) @@ TestAspect.timeout(2.seconds)

}

Removing the addHints on either test stops the hanging. I think it may be a classloading issue : having the schemas instantiated within the tests (that run in parallel) hit the classloader concurrently.

Moving the schemas away from the tests fix it too.

kubukoz commented 1 year ago

adding hints causing something to break depending on order reminds me of https://github.com/disneystreaming/smithy4s/pull/385...

any clue why concurrent classloading would be an issue here? It's strange because it wasn't a problem in previous releases (0.16.4 and earlier)

Baccata commented 1 year ago

any clue why concurrent classloading would be an issue here

possibly due to how NewType are encoded : each newtype companion object has a couple internal classes. Anyhow, without profiling it'll be hard to guess.

kubukoz commented 1 year ago

btw. I'm not able to reproduce the hang with your last snippet (replaced the version with 0.16.6)

Baccata commented 1 year ago

Can you extract the schema values away from the tests and see whether it has an effect ? (using your snippet that is)

kubukoz commented 1 year ago

it does help :)

kubukoz commented 1 year ago

I'll pay good money for someone to fix this.

Baccata commented 1 year ago

How much ?

kubukoz commented 1 year ago

like, $200?

kubukoz commented 8 months ago

Guess what, this is not a problem in 0.18.

kubukoz commented 8 months ago

update: after hours of minimization of an internal reproduction, narrowed it down to:

namespace demo

@pattern("^[a-zA-Z0-9]{8,12}$")
string MyStr

enum MyEnum {
  VALUEA
}
package demo

object B extends weaver.FunSuite {

  val dummyEnum =
    MyEnum.schema.required[Unit]("foo", _ => ???)

}

object A extends weaver.FunSuite {

  val dummyStr =
    MyStr("123456789")

}

After this, in order to eliminate codegen, I merged everything into one file: this scala-cli now reproduces the hang.

Update: eliminated other dependencies.

//> using scala "3.3.1"
//> using lib "com.disneystreaming.smithy4s::smithy4s-core:0.18.3"
package demo

import smithy4s.Enumeration
import smithy4s.Hints
import smithy4s.Newtype
import smithy4s.Schema
import smithy4s.ShapeId
import smithy4s.ShapeTag
import smithy4s.schema.EnumTag
import smithy4s.schema.Schema.bijection
import smithy4s.schema.Schema.enumeration
import smithy4s.schema.Schema.string

import java.util.concurrent.CountDownLatch
import java.util.concurrent.TimeUnit

object MyStr extends Newtype[String] {
  val id: ShapeId = ShapeId("demo", "MyStr")
  val hints: Hints = Hints.empty

  val underlyingSchema: Schema[String] = string
    .withId(id)
    .addHints(hints)
    .validated(smithy.api.Pattern(s"^[a-zA-Z0-9]{8,12}$$"))

  // this was just Schema[MyStr] but the package object can't live in this file so I inlined it
  implicit val schema: Schema[MyStr.Type] = bijection(underlyingSchema, asBijection)
}

sealed abstract class MyEnum(_value: String, _name: String, _intValue: Int, _hints: Hints)
  extends Enumeration.Value {
  override type EnumType = MyEnum
  override val value: String = _value
  override val name: String = _name
  override val intValue: Int = _intValue
  override val hints: Hints = _hints
  override def enumeration: Enumeration[EnumType] = MyEnum
  @inline final def widen: MyEnum = this
}

object MyEnum extends Enumeration[MyEnum] with ShapeTag.Companion[MyEnum] {
  val id: ShapeId = ShapeId("demo", "MyEnum")

  val hints: Hints = Hints.empty

  case object VALUEA extends MyEnum("VALUEA", "VALUEA", 0, Hints())

  val values: List[MyEnum] = List(
    VALUEA
  )

  val tag: EnumTag[MyEnum] = EnumTag.ClosedStringEnum
  implicit val schema: Schema[MyEnum] = enumeration(tag, values).withId(id).addHints(hints)
}

object B {
  val dummyEnum = MyEnum.schema.required[Unit]("foo", _ => ???)
}

object A {
  val dummyStr = MyStr("123456789")
}

object demo {

  def main(args: Array[String]): Unit = {
    val cl = new CountDownLatch(2)

    def t(f: => Unit) = {
      val thread = new Thread(() => f)
      thread.setDaemon(true)
      thread.start()
    }

    t {
      A.toString()
      cl.countDown()
    }

    t {
      B.toString()
      cl.countDown()
    }

    require(cl.await(2, TimeUnit.SECONDS), "TIMED OUT")
  }

}
kubukoz commented 8 months ago

The snippet above doesn't fail (due to a presumed deadlock) on Scala 2.13 - worth noting Dotty has a new implementation of lazy vals: https://docs.scala-lang.org/scala3/reference/changed-features/lazy-vals-init.html

Baccata commented 8 months ago

update: after hours of minimization of an internal reproduction, narrowed it down to:

We already had a pretty good minimisation, and it's very possible your new one has just unveiled something specific to enumerations, which would be a different issue. Can you remove the Enumeration in favour of another type of schema that wouldn't involve lazy vals ?

I think the initial repro hangs because the Documentation type has a Documentation hint, so there's probably a cycle in the classloading that make it subject to deadlocks

kubukoz commented 8 months ago

The earlier minimization didn't work on the latest version, hence me trying to do it again - this one came from a real app.

has just unveiled something specific to enumerations, which would be a different issue

I minimized it further from the generated code:

//> using scala "3.3.1"
//> using lib "com.disneystreaming.smithy4s::smithy4s-core:0.18.3"
//> using option "-Wunused:imports"
package demo

import smithy4s.Schema

import java.util.concurrent.CountDownLatch
import java.util.concurrent.TimeUnit

object MyStr {
  val schema = Schema.int.addHints(smithy.api.Length())
}

object MyEnum {
  val schema = Schema.int.addHints(smithy.api.Required())
}

object B {
  val dummyEnum = MyEnum.schema.toString()
}

object A {
  val dummyStr = MyStr.toString()
}

object demo {

  def main(args: Array[String]): Unit = {
    val cl = new CountDownLatch(2)

    def t(f: => Unit) = {
      val thread = new Thread(() => f)
      thread.setDaemon(true)
      thread.start()
    }

    t {
      A.hashCode()
      cl.countDown()
    }

    t {
      B.hashCode()
      cl.countDown()
    }

    require(cl.await(2, TimeUnit.SECONDS), "TIMED OUT")
  }

}

and it does seem to be Documentation-related. I have yet to confirm if it's classloading or lazy val initialization (or both).

kubukoz commented 8 months ago

YourKit detected a potential deadlock:

Potential deadlock: frozen threads found

It seems that the following threads have not changed their stack for more than 10 seconds.
These threads are possibly (but not necessarily!) in a deadlock or hung.

+----------------------------------------------------------------------------------------------------------------------+
|                                                         Name                                                         |
+----------------------------------------------------------------------------------------------------------------------+
|  +---Thread-0 Frozen for at least 10s <Ignore a false positive>                                                      |
|  | |                                                                                                                 |
|  | +---smithy4s.schema.Field$.required(String, Schema, Function1) Field.scala:85                                     |
|  | |                                                                                                                 |
|  | +---smithy4s.schema.Schema$PartiallyAppliedRequired$.apply$extension(Schema, String, Function1) Schema.scala:271  |
|  | |                                                                                                                 |
|  | +---smithy.api.TraitDiffRule$.<clinit>() TraitDiffRule.scala:29                                                   |
|  | |                                                                                                                 |
|  | +---smithy.api.Trait$.<clinit>() Trait.scala:28                                                                   |
|  | |                                                                                                                 |
|  | +---smithy.api.Documentation$.<clinit>() Documentation.scala:16                                                   |
|  | |                                                                                                                 |
|  | +---smithy.api.Length$.<clinit>() Length.scala:23                                                                 |
|  | |                                                                                                                 |
|  | +---demo.MyStr$.<clinit>() main.scala:12                                                                          |
|  | |                                                                                                                 |
|  | +---demo.A$.<clinit>() main.scala:24                                                                              |
|  | |                                                                                                                 |
|  | +---demo.demo$.main$$anonfun$1(CountDownLatch) main.scala:39                                                      |
|  | |                                                                                                                 |
|  | +---demo.demo$.main$$anonfun$adapted$1(CountDownLatch) main.scala:41                                              |
|  | |                                                                                                                 |
|  | +---demo.demo$$$Lambda$1.0x0000007001001000.apply()                                                               |
|  | |                                                                                                                 |
|  | +---scala.Function0.apply$mcV$sp() Function0.scala:42                                                             |
|  | |                                                                                                                 |
|  | +---demo.demo$.$anonfun$1(Function0) main.scala:33                                                                |
|  | |                                                                                                                 |
|  | +---demo.demo$$$Lambda$2.0x00000070010012c8.run()                                                                 |
|  | |                                                                                                                 |
|  | +---java.lang.Thread.run() Thread.java:833                                                                        |
|  |                                                                                                                   |
|  +---Thread-1 Frozen for at least 10s <Ignore a false positive>                                                      |
|    |                                                                                                                 |
|    +---smithy.api.Required$.<clinit>() Required.scala:18                                                             |
|    |                                                                                                                 |
|    +---demo.MyEnum$.<clinit>() main.scala:16                                                                         |
|    |                                                                                                                 |
|    +---demo.B$.<clinit>() main.scala:20                                                                              |
|    |                                                                                                                 |
|    +---demo.demo$.main$$anonfun$2(CountDownLatch) main.scala:44                                                      |
|    |                                                                                                                 |
|    +---demo.demo$.main$$anonfun$adapted$2(CountDownLatch) main.scala:46                                              |
|    |                                                                                                                 |
|    +---demo.demo$$$Lambda$3.0x00000070010014f8.apply()                                                               |
|    |                                                                                                                 |
|    +---scala.Function0.apply$mcV$sp() Function0.scala:42                                                             |
|    |                                                                                                                 |
|    +---demo.demo$.$anonfun$1(Function0) main.scala:33                                                                |
|    |                                                                                                                 |
|    +---demo.demo$$$Lambda$2.0x00000070010012c8.run()                                                                 |
|    |                                                                                                                 |
|    +---java.lang.Thread.run() Thread.java:833                                                                        |
+----------------------------------------------------------------------------------------------------------------------+

and from the looks of it, it's Required on the surface (Documentation appears in the stack too though)

kubukoz commented 8 months ago

update! Native Image also suffers the same outcome. Wouldn't this exclude classloading?

kubukoz commented 8 months ago

Just had a chat with @Baccata about this.

We're going with the assumption that this is generally caused by the combination of:

Whether it's caused by the encoding of lazy vals alone (which I suspect) or classloading (which I wasn't able to confirm with a profiler), the solution will likely require us to get rid of the recursive initialization, rather than introduce more laziness anywhere - at this point it'd just risk bumping into other similar problems down the line, perhaps even past bootstrapping time (which would be terrible for applications, imagine randomly hanging while handling a request).

We'll proceed with an experiment of the following: generate DynamicBindings for hint values that include a recursive reference. I will do a POC of this by hardcoding this behavior to the relevant hints (Documentation and Trait for starters).

In short, this will apply a change like:

object Documentation extends Newtype[String] {
  val id: ShapeId = ShapeId("smithy.api", "documentation")
  val hints: Hints = Hints(
-    smithy.api.Documentation("Adds documentation to a shape or member using CommonMark syntax."),
+    ShapeId("smithy.api", "documentation") -> Document.fromString("Adds documentation to a shape or member using CommonMark syntax."),
    smithy.api.Trait(selector = None, structurallyExclusive = None, conflicts = None, breakingChanges = None),
  )
  val underlyingSchema: Schema[String] = string.withId(id).addHints(hints)
  implicit val schema: Schema[Documentation] = recursive(bijection(underlyingSchema, asBijection))
}

which goes into the DynamicBinding code path, currently used (primarily? exclusively?) for DynamicSchemaIndex.

kubukoz commented 8 months ago

The experiment seems to have been successful:

  val hints: Hints = Hints(
    ShapeId("smithy.api", "documentation") -> Document.nullDoc,
    ShapeId("smithy.api", "trait") -> Document.nullDoc
  )

with these, there's no longer a hang (ran the code 100 times). I applied the following simplifications:

kubukoz commented 8 months ago

Here's where it actually stops (both threads in RUNNING state):

image image

which shows how Documentation is involved: Thread 1 is trying to evaluate the Hints of Required (one of which is Documentation), and Thread 0 is in the middle of Documentation's <clinit>, getting stuck presumably before it can call Required.<clinit>.

Given <clinit> is the initializer of the class, I suppose it's technically correct that this is all due to lazy classloading.

kubukoz commented 6 months ago

OK @baccata, I think it's time to make a choice...

The options attempted are:

The "smart" and complex one: #1295

We detect recursive traits and make the appropriate hints use dynamic bindings.

Hints.lazily: #1326

We wrap all Hints(...) with Hints.lazily(=>Hints).

Lazy static bindings

I like the last option the most, but Hints.lazily also sounds good.

Baccata commented 6 months ago

So far I liked option 3 best, and I think it's fine to target 0.19. I think we ought to measure the impact on memory / compilation time though, before making a final decision, to ensure the overhead is not dramatic.

channingwalton commented 6 months ago

Just in case it helps, I believe I hit the same issue: running tests in parallel resulting in everything hanging. A thread dump revealed various combinations of the following:

"io-compute-6" #722 daemon prio=5 os_prio=31 cpu=50.44ms elapsed=1099.03s tid=0x000000013732fc00 nid=0x16e03 in Object.wait()  [0x00000002de0b8000]
   java.lang.Thread.State: RUNNABLE
        at smithy.api.Documentation$.<clinit>(Documentation.scala:16)
        - waiting on the Class initialization monitor for smithy.api.Trait$
…
"io-compute-2" #718 daemon prio=5 os_prio=31 cpu=35.78ms elapsed=1099.03s tid=0x00000001374fe200 nid=0x1ed03 in Object.wait()  [0x00000002dc087000]
   java.lang.Thread.State: RUNNABLE
        at smithy.api.Trait$.<clinit>(Trait.scala:27)
        - waiting on the Class initialization monitor for smithy.api.Documentation$
kubukoz commented 6 months ago
import cats.effect.IO
import cats.effect.IOApp
import cats.implicits._
import {<confidential> => SimpleRestJsonBuilder}
import org.http4s.ember.server.EmberServerBuilder
import smithy4s.kinds.Kind1
import smithy4s.kinds.Kind5
import smithy4s.kinds.PolyFunction5
import smithy4s.schema.Schema
import smithy4s.schema.Schema.RefinementSchema

import scala.io.StdIn

object MyMain extends IOApp.Simple {

  def measured[A](tag: String)(a: => A): A = {
    val start  = System.nanoTime()
    val result = a
    val end    = System.nanoTime()
    println(s"$tag - Time: ${(end - start) / 1000000}ms")
    result
  }

  measured("init") {
    println(ServiceTags.All.size)
  }

  val allRoutes =
    measured("allRoutes") {
      ServiceTags.All.map { service =>
        SimpleRestJsonBuilder(service)
          .routes(service.impl(new service.FunctorEndpointCompiler[IO] {
            override def apply[A0, A1, A2, A3, A4](
                fa: service.Endpoint[A0, A1, A2, A3, A4]
            ): Kind5[Kind1[IO]#toKind5]#handler[A0, A1, A2, A3, A4] = _ => IO.stub
          }))
          .make
          .toTry
          .get
      }
    }

  val totalRoute = allRoutes.reduce(_ <+> _)

  def run = EmberServerBuilder.default[IO].withHttpApp(totalRoute.orNotFound).build.useForever
}

Tried this with the following setup:

Before:

Memory usage: heap stabilizes around 186MB, non-heap around 290MB

After #1327:

Memory usage: heap stabilizes around 153MB (strange), non-heap around 333MB

To sum up, the initialization time of the objects (init) increased on average by 21.75%, while time to compile the serices increased by 5.98%.

Memory usage decreased on the heap side by 17.75%, increased on non-heap by 14.83%. I suppose the classes are bigger (and there's more of them) due to all the new bytecode supporting the laziness?

I think the ~21% increase in initialization time is a bit high, but also this is 185 services, so not something typical.

Baccata commented 6 months ago

@kubukoz any chance you could run the same experiment with solution 2 (Hints.lazily) as well ?

Even if the experiment is not representative of a nominal case (due to the sheer number of services), we can reasonably assume that the initialisation time increase is representative of the change. If solution 2 yields (much) better numbers, I'd roll with that.

kubukoz commented 6 months ago

Solution 2 (#1326):

Memory: ...well, hard to talk about this because it went up to ~400MB and then down to ~120, when the first GC cycle hit. Don't know what to make of that... guess I should've ran the previous experiments longer or forced a GC run.

Compared to the status quo, init only gets worse by 7.16%, allRoutes by 1.96%. Clearly beats solution 3 😬

Baccata commented 6 months ago

Alright, let's roll with that one. We can make the diff less prominent by using a by-name extension (Hints.lazily(Hints(....)) => Hints(...).lazily)

kubukoz commented 6 months ago

That's neat, looks like it doesn't make the perf worse either: compared to the "before", in this test round init took only 4.87% longer on average (3050.4ms, stddev 41.03ms), and allRoutes actually got faster by 4.16% (916.8ms, stddev 16.90ms).

I thought these were strange so I re-run the original 0.18.5 setup, it was marginally faster than before (0.5% in init, 3.85% in allRoutes).

Let's stick with .lazily.