com-lihaoyi / upickle

uPickle: a simple, fast, dependency-free JSON & Binary (MessagePack) serialization library for Scala
https://com-lihaoyi.github.io/upickle
MIT License
717 stars 165 forks source link

Writing Scala 3 enums is broken #356

Closed lolgab closed 1 year ago

lolgab commented 3 years ago

Trying to write Scala 3 enum values actually writes always the first one.

Test reproducing the problem

package upickle

import scala.language.implicitConversions
import utest._

import upickle.default._

object EnumTests extends TestSuite {

  enum SimpleEnum {
    case A, B
  }

  val tests = Tests {
    test("simple") {
      given ReadWriter[SimpleEnum] = macroRW[SimpleEnum]
      test("enum write") {
        val parsed = write(SimpleEnum.B)
        val expected = """{"$type":"upickle.EnumTests.SimpleEnum.B"}"""
        assert(parsed == expected)
      }
    }
  }
}

Fails with:

X upickle.EnumTests.simple.enum write 71ms 
  utest.AssertionError: parsed == expected
  parsed: String = {"$type":"upickle.EnumTests.SimpleEnum.A"}
  expected: String = {"$type":"upickle.EnumTests.SimpleEnum.B"}
htmldoug commented 3 years ago

There might be something useful that can be ported over from https://github.com/rallyhealth/weePickle/pull/94.

Lasering commented 2 years ago
import upickle.default.*

enum Color derives Writer:
  case Red, Green, Blue

println(write(Color.Blue))
println(write(Color.Red))
println(write(Color.Green))

Prints {"$type":"Color.Red"} three times.

Lasering commented 2 years ago

The problem is most likely in the macros.summonList. It should be something like:

inline final def summonWriters[T <: Tuple]: List[Writer[_]] =
  inline erasedValue[T] match
    case _: EmptyTuple => Nil
    case _: (t *: ts) => summonWriter[t] :: summonWriters[ts]

inline final def summonWriter[A]: Writer[A] =
  summonFrom {
    case w: Writer[A] => w
    case _: Mirror.Of[A] => Writer.derived[A]
  }
BoopBoopBeepBoop commented 2 years ago

It's my first time in this code so apologies if I have it wrong, but it seems like the root of the problem is that enumeration values are anonymous types, and do not get their own type at runtime - hence, the ClassTag[T] that gets passed through to CaseCaseWriter is actually the class of the enumeration, not the individual case. This results in the annotate call always resolving to the first writer it finds, as all of the specialized writers for the enumeration values are considered (correctly) subtypes of this class tag.

If it's fine to have the default behavior for enumerations be to read/write out the string value, I can take a shot at porting the work @htmldoug referenced above? Which is to generate alternate readers/writers for enumerations using the generated valueOf method.

Lasering commented 2 years ago

Although a ClassTag is required for the macroW it is not used for the Mirror.SumOf case. So that is unlikely to be the problem.

BoopBoopBeepBoop commented 2 years ago

I spent a number of hours debugging this yesterday, and am fairly certain that it is the issue. This is not a problem deriving writers (writers are derived correctly for each enumeration value), it is specifically a problem writing the annotation. The ClassTag is used specifically when making the call to annotate to filter the set of available leaf writers

CLASS NAME: upickle.EnumTests.SimpleEnum.B Classtag: upickle.EnumTests$SimpleEnum

This is from a println inside the derived writer. As you mention, the Classtag is not used there, so it is not a problem there. However, the call to annotate ultimately calls into TaggedWriter.Node/TaggedWriter.Leaf with that information

  object TaggedWriter{
    class Leaf[T](c: ClassTag[_], tag: String, r: CaseW[T]) extends TaggedWriter[T]{
      def findWriter(v: Any) = {
        if (c.runtimeClass.isInstance(v)) tag -> r
        else null
      }
    }
    class Node[T](rs: TaggedWriter[_ <: T]*) extends TaggedWriter[T]{
      def findWriter(v: Any) = scanChildren(rs)(_.findWriter(v)).asInstanceOf[(String, CaseW[T])]
    }
  }

We enter via the findWriter call in Node. The children are a list of Leaf for each children. The _.findWriter call then returns with whatever the first value in the list is, because all of the enumeration values are subtypes of the incorrect classtag.

I am not sure exactly why this is the case, but as mentioned previously, it appears that enumeration values are not considered types and cannot receive a classtag proper - I can invoke implicitly[ClassTag[MyEnum]], but not implicitly[ClassTag[MyEnum.Value1]]. Therefore this method of using subtype filtering based on the ClassTag stored in the derived Writer seems to not be possible for enumerations.

Lasering commented 2 years ago

From my experience implementing this PR ClassTags aren't needed, however uPickle is using them for something, so I'm clearly missing something.

I would expect that for simple enums (where the cases are simple names) the Mirror.ProductOf would never be invoked, and annotate is only called there. The Mirror.SumOf only calls the macros.summonList and the Writer.merge. The summonList is invoked with macros.summonList[Tuple.Map[m.MirroredElemTypes, Writer]] which seems a little magical. The Writer.merge invokes TaggedWriter.Node which does not involve ClassTags, the findWriter there is being invoked on the writers returned by summonList.

BoopBoopBeepBoop commented 2 years ago

Yeah, I don't 100% understand the intent of the tagging code - I was not willing to refactor the way that the annotate calls worked as my first attempt to change the code :sweat_smile: especially given that I haven't worked on crossbuild setups before and it's taxing; I'm finding I have to keep track of many more of the relations in my head rather than via Intellij's type inference. I also had a nagging sense that the ClassTag shouldn't be required in an ideal solution, as the information required for the annotation is technically already present in the writer. Perhaps @lihaoyi could comment on the intent of the TaggedWriter setup?

Have you looked at the PR I posted? Handling enums as being identified by their string value is the way I've seen them used in all codebases I've worked with in practice, usually with custom serializers. Do you think that is a reasonable default?

lihaoyi commented 2 years ago

From what I can tell, TaggedWriter uses a class tag for each TaggedWriter.Leaf node to determine whether it's the correct writer to handle the class or if someone else should do it. The annotate calls construct the leafs, and the merge calls construct the non-leaf nodes. When something comes in to be serialized, it basically walks the tree of TaggedWriter.Node/TaggedWriter.Leafinstances looking for a writer to handle a particular class, and uses that writer to write it.TaggedReaderhas a similar setup, but uses a stringtag` comparison rather than using a class comparison, since that's all we have during deserialization.

The relevant code is here https://github.com/com-lihaoyi/upickle/blob/b8ab7ab21b76fb085497ab5b84f424647bf3e2dd/core/src/upickle/core/Types.scala#L336

If you can get away without using a ClassTag, e.g. by making TaggedWriter use a string tag the same way TaggedReader does, then great.

I can't answer questions about the Scala 3 stuff, as I'm not really familiar with Scala 3 macros or language features. Maybe @anatoliykmetyuk or @jodersky would be able to help more. But in general, I trust our test suite, so if you can get your PR green with addition tests to cover the fixed logic, that's probably about all the assurance I need

BoopBoopBeepBoop commented 2 years ago

Ok thanks @lihaoyi. In terms of general strategy for enumerations - would you prefer that they be serialized as the string representation by default when using macroRW? Or serialized the way that any other case class would be.

To use the example above:

import upickle.default.*

enum Color derives Writer:
  case Red, Green, Blue

println(write(Color.Blue))
println(write(Color.Red))
println(write(Color.Green))

Option 1

"Blue"
"Red"
"Green"

Option 2

{"$type":"Color.Blue"}
{"$type":"Color.Red"}
{"$type":"Color.Green"}

I can see an argument for either; option 1 is how I always see enumerations being dealt with in practice, whereas option 2 allows for extra properties to be included in the serialization by default, which could be handy, but feels less like "canonical" json in my opinion.

Also with option 2, if enumeration values change between versions of 2 pieces of software, I'm not sure how deserializing it should work - a situation like {"$type":"Color.Red", "ordinal": 0} deserializing to an enumeration value that is Color.Red(ordinal = 1) seems strange. But we wouldn't be able to set those values for enumeration instances, so you'd have to "lie" during deserialization I think and locate the correct instance using valueOf anyways?


Separate question, are there contributor guidelines that cover how to add to the documentation for the site? The enumeration handling should be added as part of the PR

lihaoyi commented 2 years ago

@BoopBoopBeepBoop For now let's go with Option 2.

If we go with Option 1 for enums, we should do the same for case objects, but that would be a significant breaking change. That's not to say we shouldn't do it, but we shouldn't do it in this PR. Since Enums are basically case-class/case-object hierarchies, we should just follow suite for now.

Maybe we can add another flag or annotation to allow the "short" enum serialization format, but let's do that in a follow up if desired

BoopBoopBeepBoop commented 2 years ago

Ah, I hadn't considered case objects. Or for something even worse, a sealed trait whose subtypes include enumerations mixed in. Ew. I'll see if I can find a nice way to distinguish between deriving the default ReadWriter and an enum-specific one.

I think case objects are broken for scala 3 right now as well, btw.

sealed trait Animal derives ReadWriter

case class Person(name: String, address: String, age: Int = 20)
  extends Animal
  derives ReadWriter

case class Cat(name: String, owner: Person)
  extends Animal
  derives ReadWriter

case class Dog(name: String, age: Int)
  derives ReadWriter

case object Cthulu
  extends Animal
  derives ReadWriter

From the scala 3 tests - the addition of Cthulu causes macro expansion failures. I'll put up a separate issue/PR for that.

JD557 commented 2 years ago

Looks like ADTs encoded using enums also fail.

Using the Color example in https://docs.scala-lang.org/scala3/reference/enums/adts.html

import upickle.default._

enum Color(val rgb: Int):
  case Red   extends Color(0xFF0000)
  case Green extends Color(0x00FF00)
  case Blue  extends Color(0x0000FF)
  case Mix(mix: Int) extends Color(mix)

object Color {
  implicit val rw: ReadWriter[Color] = macroRW
}

Fails with

Exception occurred while executing macro expansion.
java.lang.Exception: Enumeration valueOf method not found
    at upickle.implicits.macros.macros$package$.$anonfun$8(macros.scala:104)
    at scala.Option.getOrElse(Option.scala:201)
    at upickle.implicits.macros.macros$package$.enumValueOfImpl(macros.scala:105)