VirtusLab / scala-yaml

https://virtuslab.github.io/scala-yaml/
Apache License 2.0
92 stars 22 forks source link

Parsing of escaped characters in quoted strings and backslashes in unquoted strings #335

Open OndrejSpanel opened 2 months ago

OndrejSpanel commented 2 months ago

In the following code the escaped characters are ignored and parsed as a double backslash and a backslash followed by b, instead of a backslash and a backspace:

import org.virtuslab.yaml.*

val yaml = """
regexBoundary: "\\b"
backspace: "\b"
regexBoundaryUnquoted: \b
"""

case class Example(
    regexBoundary: String,
    backspace: String,
    regexBoundaryUnquoted: String,
) derives YamlCodec

val example = yaml.as[Example].toOption.get

println(example.regexBoundary)
println(example.backspace)
println(example.regexBoundaryUnquoted)

See also https://scastie.scala-lang.org/OndrejSpanel/jfavH3u2Sq29Gs5nQHmXKQ/92

The output is:

\\b
\b
\\b

None of this is correct. There should be no double backslashes and there should be a backspace, not \b in the second line.

Note: even the unquoted string is parsed wrong. The single backslash is converted to a double backslash in the case class value.

lbialy commented 2 months ago

First of all, there's some interference from triplequote here:

scala> "\\b"
val res0: String = \b

scala> """\\b"""
val res1: String = \\b

scala> val yaml = """
     | regexBoundary: "\\b"
     | backspace: "\b"
     | regexBoundaryUnquoted: \b
     | """
val yaml: String = "
regexBoundary: "\\b"
backspace: "\b"
regexBoundaryUnquoted: \b
"

This can be fixed with:

scala> val yaml = s"""
     | regexBoundary: "${"\\b"}"
     | backspace: "${"\b"}"
     | regexBoundaryUnquoted: ${"\b"}
     | """
val yaml: String = "
regexBoundary: "\b"
backspace: "
regexBoundaryUnquoted:
"

When we try to parse that we get an error:

scala> val example = yaml.as[Example]
val example: Either[org.virtuslab.yaml.YamlError, Example] = Left(org.virtuslab.yaml.ConstructError: Could't construct java.lang.String from null (tag:yaml.org,2002:null)
regexBoundaryUnquoted:
                       ^
)

this is probably a mistake as there is a character (the \b) as a value of regexBoundaryUnquoted mapping. If we drop the unquoted field we get:

scala> case class Example(
     |     regexBoundary: String,
     |     backspace: String,
     | ) derives YamlCodec
// defined case class Example

scala> val example = yaml.as[Example].right.get
val example: Example = Example(\b)

scala> example.regexBoundary
val res0: String = \b

scala> example.backspace
val res1: String =

scala> YamlEncoder.escapeSpecialCharacters(example.backspace)
val res2: String = \u0008

Which is what you'd expect, I guess. I think there are issues around escaping for Scalar nodes due to this code in ScalarStyle.scala:

sealed abstract class ScalarStyle(indicator: Char)
object ScalarStyle {
  case object Plain        extends ScalarStyle(' ')
  case object DoubleQuoted extends ScalarStyle('"')
  case object SingleQuoted extends ScalarStyle('\'')
  case object Folded       extends ScalarStyle('>')
  case object Literal      extends ScalarStyle('|')

  def escapeSpecialCharacter(scalar: String, scalarStyle: ScalarStyle): String =
    scalarStyle match {
      case ScalarStyle.DoubleQuoted => scalar
      case ScalarStyle.SingleQuoted => scalar
      case ScalarStyle.Literal      => scalar
      case _ =>
        scalar.flatMap { char =>
          char match {
            case '\\'  => "\\\\"
            case '\n'  => "\\n"
            case other => other.toString
          }
        }
    }

but those are limited to some escapes for unquoted strings which do make sense to be honest but I'm not sure if they are 100% correct as they were here before I started maintaining the lib.

OndrejSpanel commented 2 months ago

interference from triplequote

Triplequotes prevent backslashes to be used as escapes, they are used as literals instead. This is expected, as triple quotes define raw string literals.

lbialy commented 2 months ago

yeah, but it's not a problem with yaml parser, you get what you see

lbialy commented 2 months ago

I think \b does get borked in the escaping of unquoted strings:

scala> val yaml = s"""
     | regexBoundary: "${"\\b"}"
     | backspace: "${"\b"}"
     | regexBoundaryUnquoted: some${"\b"}text${"\b"}
     | """
val yaml: String = "
regexBoundary: "\b"
backspace: "
regexBoundaryUnquoted: somtext
"

notice somtext due to backspace control char being correctly rendered by terminal here

scala> YamlEncoder.escapeSpecialCharacters(yaml)
val res6: String = "
regexBoundary: "\b"
backspace: "\u0008"
regexBoundaryUnquoted: some\u0008text\u0008
"

scala> val example = yaml.as[Example].right.get
val example: Example = Example(\b,somtext)

scala> YamlEncoder.escapeSpecialCharacters(example.regexBoundaryUnquoted)
val res8: String = some\u0008text

notice missing \u0008 after text here. It got trimmed. I have to go through the spec on parsing to understand what is the correct (or rather: spec-compliant) behavior here.

OndrejSpanel commented 2 months ago

yeah, but it's not a problem with yaml parser, you get what you see

I am not sure I understand. Compare this with Circe behaviour in https://scastie.scala-lang.org/OndrejSpanel/cz1HDKc7RoaOgEnrg6YcQA/5.

Instead of triple quotes I could use an input from a file. When there is a backslash in the quoted string, it should be processed as an escape by the YAML parser. When it is in an unquoted input, it should be processed as a backslash character.

What I see instead is it is processed as a backslash character in a quoted string and as two backslashes in an unquoted string.

OndrejSpanel commented 2 months ago

Note: it is not my intention to have backspace characters present in my input. I want \b escaped sequence to be present there, which is exactly what triple quotes allow me to do - you can imagine you are reading the input from a file instead. The code using interpolation places a backspace character into the input, which is not what I am interested about and I have no idea how such thing should be handled by a parser.

OndrejSpanel commented 2 months ago

From specs: https://yaml.org/spec/1.2.2/#57-escaped-characters

Note that escape sequences are only interpreted in double-quoted scalars. In all other scalar styles, the “\” character has no special meaning and non-printable characters are not available.

lbialy commented 2 months ago

Ahhh, I misunderstood your intent. Ok, I get it now.

OndrejSpanel commented 2 months ago

Another example which is related, but perhaps simpler: at the moment I cannot find a way to represent a backslash character in my input. Using \ in unquoted strings results in a double backslash. using double backslash in quoted strings results in a crash or strange behaviour:

Check:

import org.virtuslab.yaml.*

val yaml = """value: \"""
case class Example(value: String) derives YamlCodec
yaml.as[Example].toOption.get

Or even worse:

import org.virtuslab.yaml.*

val yaml = """
quoted: "\\"
unquoted: \
"""

case class Example(quoted: String) derives YamlCodec

yaml.as[Example].toOption.get

Which results in the strange:

Example(\" unquoted: \ )