disneystreaming / smithy4s

https://disneystreaming.github.io/smithy4s/
Other
340 stars 68 forks source link

Default value of `smithy4sOutputDir` can break codegen cache #1495

Closed majk-p closed 1 month ago

majk-p commented 2 months ago

TL;DR

Can we please change:

config / smithy4sOutputDir := (config / sourceManaged).value / "scala"

to something like:

config / smithy4sOutputDir := (config / sourceManaged).value / "scala" / "smithy4s"

Or exclude codegenArgs.output from cache altogether?

Problem

I'm working on an sbt plugin that uses smithy4s, but also generates some of it's own managed sources. Smithy4s produces files to smithy4sOutputDir which by default is defined as smithy4sOutputDir := (config / sourceManaged).value / "scala". This value is then later used by Smithy4sCodegenPlugin as a part of codegenArgs used to produce files. The entire CodegenArgs object is used as a cache key and thus the output counts in. That wouldn't be a problem because the path would just remain cached as a String, but smithy4s also overrides the json format for os.Path in a way that calculates the hash of the entire directory content. It isn't a problem as long as smithy4s is the only source generator. When another plugin adds managed sources the hashes change and we stop hitting cache. In some cases this might lead to repeated very long compilation times.

Perhaps it's worth reconsidering if smithy4s codegen cache should rely on the contents of source_managed and resource_managed at all, since it seems easy to invalidate the cache.

It's also not obvious to the end user why the cache got invalidated. In my debugging case I would be very happy if the plugin told me why it decided to rerun codegen even though smithy sources remained unchanged.

Solution

I think adding another segment in the path should solve this and save fellow sbt plugin authors some headache. This way other plugins can either use the root sourceManaged / scala directory or anything different than sourceManaged / scala / smithy4s.

The other option is to modify JsonConverters.codegenArgsIso in a way that it doesn't serialize output.

kubukoz commented 2 months ago

I'm thinking to do both:

Baccata commented 2 months ago

I agree with Jakub.

majk-p commented 2 months ago

Sounds good, I can raise a PR to address this

majk-p commented 2 months ago

Working on my initial problem I've noticed an additional issue around pathFormat

implicit val pathFormat: JsonFormat[os.Path] =
  BasicJsonProtocol.projectFormat[os.Path, HashFileInfo](
    p => {
      if (os.isFile(p)) FileInfo.hash(p.toIO)
      else
        // If the path is a directory, we get the hashes of all files
        // then hash the concatenation of the hash's bytes.
        FileInfo.hash(
          p.toIO,
          Hash(
            os.walk(p)
              .map(_.toIO)
              .map(Hash(_))
              .foldLeft(Array.emptyByteArray)(_ ++ _)
          )
        )
    },
    hash => os.Path(hash.file)
  )

It assumes all paths will exist, which is not always the case. In one of my sub-modules I noticed the resource_managed was not present, which silently causes the cache miss and code regeneration. Since we're about to exclude hashes of the outputs (both sources and resources) the issue should disappear for now, but this may require some more thought in future.

kubukoz commented 2 months ago

The paths are checked for existence first:

val inputFiles =
  (inputDirs ++ generatedFiles)
    .filter(_.exists())
    .toList

but only for the inputs 😅