Block string literals - Githubissues

aljazerzen commented 11 months ago

This is followup of https://github.com/PRQL/prql/pull/3679#issuecomment-1763491707

I propose we add "block string literals" which are:

syntax for a string literal,
it starts with one " followed by a new line,
it can then contain repeated lines of:
- many or no white-space characters (which is discarded)
- one " character (also discarded),
- any characters, other than new line (these are the string literal contents)
- a new line (also included in the contents)
it ends with a line that does not start with a " character, preceded by white-space
it supports escaped characters

Example:

let a = "
    "this is the first line
    "  second line, indented with two spaces
    "  third line, also indented with two spaces

... equivalent to:

let a = "
    "this is the first line
    "  second line, indented with two spaces
  "  third line, also indented with two spaces

... equivalent to:

let a = """this is the first line
  second line, indented with two spaces
  third line, also indented with two spaces
"""

Pros:

lexing of one line is (mostly) independent of lexing of the other lines. This simplifies the lexer.
multi-line text blocks can be indented,
text blocks that start with a non-new-line char and end with a new-line look nice

Cons:

pasting a large number of lines would require you to prefix each one with a "
a bit unconventional

Inspired by zig: https://ziglang.org/documentation/master/#String-Literals-and-Unicode-Code-Point-Literals.

The syntax that I propose differs in

usage of " instead of // and
the leading " on the first line, which is needed because our newlines have semantic meaning.

Possible extensions:

r-string variant (or even f-string and s-string)

richb-hanover commented 11 months ago

Another pro: all rows of a multi-line string line up - both the first and all following lines - in an obvious way. See the first example, and compare to the final...
Another con: I might like this so much that it will make me sad to type multi-line strings in other languages :-)

max-sixty commented 11 months ago

I agree with the pros & cons.

I do think it looks a bit alien, and in a different way than the zig version. Possibly that's because in most languages this is a common invalid expression — i.e. an unfinished string.

(only an aesthetic concern, so let's not weigh highly)

Given it's a very new construction and doesn't block anything, this is something I think is worth leaving open for a bit, for us to contemplate and aggregate views. (But we also shouldn't just leave it forever, let's make a decision in a couple of weeks?)

snth commented 11 months ago

As mentioned in #3679, I'm in favour of d-strings rather, probably without the d prefix so that it can apply to s- and f-strings too.

To wit:

This is rough idea about d-string:

d-string starts with triple quote (""" or ''') followed by newline. The newline is not included in the string.

d"spam" and d'''egg''' are syntax error.

d-string ends with indent and triple quote.

Only the indent to be removed is allowed before closing triple quote in the line.

Indents in lines that are same to the last line is stripped.

d-string can be used with ‘f/F’ and `r/R’ prefix.

As for the delimiter, you could use d"""\n...\n""" or """\n...\n""" or even just "\n...\n". I would prefer some sort of triple-quote because that is the convention but it would not be necessary. The neat thing about this is that you can change the level of indentation of the whole block by just changing the amount of indentation of the closing delimiter.

I don't understand what would be the benefit of having to start each line with "? I guess it probably comes down to

lexing of one line is (mostly) independent of lexing of the other lines. This simplifies the lexer.

but weighing that up against

pasting a large number of lines would require you to prefix each one with a "

, the lexer is written once by a handful of people whereas pasting a large number of lines will probably happen multiple times a day (/week/month) by thousands of people. I know I'm not one of the lexer writers so it's a bit unfair for me to say but I do feel the leading quotes would be a big usability impairment. Maybe your text editor can automate it but then what about the Playground, or once we fix that then DBeaver etc... ?

Examples

let a = "
    "this is the first line
    "  second line, indented with two spaces
  "  third line, also indented with two spaces

I find this very hard to parse by human eye and to me it is the opposite of significant whitespace.

I would much rather see

let a = "
this is the first line
  second line, indented with two spaces
  third line, also indented with two spaces
"

or better

let a = """
this is the first line
  second line, indented with two spaces
  third line, also indented with two spaces
"""

equivalent to (everything lines up above the closing delimiter)

let a = """
    this is the first line
      second line, indented with two spaces
      third line, also indented with two spaces
    """

Also, if you wanted to quote dialog from a play or say a chatbot interaction?

let quotes = """
"To PRQL, or not to PRQL: that is the query."
"If PRQL be the language of data, query on."
"Friends, SQL users, data enthusiasts, lend me your queries."
"""

Pretty niche use case but with the increasing use of LLMs, perhaps the amount of leading quotes will not be insignificant?

aljazerzen commented 11 months ago

Let's discuss d-strings in a separate thread, with a full description of how would they work in PRQL.

the lexer is written once by a handful of people whereas pasting a large number of lines will probably happen multiple times a day (/week/month) by thousands of people. I know I'm not one of the lexer writers so it's a bit unfair for me to say but I do feel the leading quotes would be a big usability impairment. Maybe your text editor can automate it but then what about the Playground, or once we fix that then DBeaver etc... ?

It's not about code complexity, but about the quality of compiler/language server error recovery. In short, think about what happens if when you have an unenclosed quote: all characters before the end of file or the next quote becomes the contents of the string. This will prevent the compiler from reporting any other error messages, language server to discard any precompiled information about the file and in a rare case report incorrect error location:

from x
derive a = "hello
derive b = "world"
                 \___ unclosed string at line 3

For more info, read On Modularity of Lexical Analysis, which has been linked to a bit too much.

whereas pasting a large number of lines will probably happen multiple times a day (/week/month) by thousands of people

This is a valid argument.

I'd say that this:

let a = "
    "this is the first line
    "  second line, indented with two spaces
    "  third line, also indented with two spaces

... look equally nice as this:

 let a = """
    this is the first line
      second line, indented with two spaces
      third line, also indented with two spaces
    """

Under my proposal above, this is valid:

let quotes = "
  ""To PRQL, or not to PRQL: that is the query."
  ""If PRQL be the language of data, query on."
  ""Friends, SQL users, data enthusiasts, lend me your queries."

kgutwin commented 10 months ago

I had one thought... I may be mistaken, but it seems like the most common use of multi-line strings in PRQL today would be from_text (at least, that's where I see them most used in the examples). Imagining for a moment that users wanted to use from_text in any sort of scripted capacity (again, at least, that's how I plan to use it in the near term), the suggestion to have " as a per-line prefix would make this use case quite painful. As an example, suppose the user was building a PRQL script that was templated with Jinja; currently this could be very straightforward:

from_text """
{{ source_csv_data }}
"""
derive {
    d = b + c,
    answer = 20 * 2 + 2,
}

You could also imagine a shell pipeline equivalent:

(
    echo 'from_text """'
    cat $INPUT_CSV
    echo '"""'
    echo 'derive { d = b + c }' 
) | prqlc compile > script.sql

Yes, of course it's possible to do the necessary "quote prefix" operation in both of these cases; my only thought is I would expect that most users would find needing to do so frustrating. And while you could also emphasize that loading data using from_text is not preferred, it's precisely those cases where from_text is most useful (small translation tables that don't already exist in the database but are in a little CSV file) that would be most affected by this proposal.

aljazerzen commented 10 months ago

Good point.

I want to say that ideally, people would not be using a templating engine over PRQL, as that means that PRQL lacks constructs for what you are trying to do with the template. It also makes you prone to injection attacks, but nevertheless, people will want to do templates, and such string literals would be quite an inconvenient blocker for that.

vanillajonathan commented 10 months ago

Do we need this?

I think we should avoid introducing things that people are not familiar with and syntax that looks foreign to users.

If we introduce something like this, I think it would be preferable go with the d-string because it looks just like an ordinary string prefixed with a d.

aljazerzen commented 10 months ago

The original motivation was to remove other multi-line-strings, as they cause problems for tooling development and performance in the future.

vanillajonathan commented 10 months ago

Yeah, maybe we should remove the multi-line strings because if I understand it correctly, they're a bit unorthodox as they have the syntax of normal strings but can span multiple lines and in addition to that we also have the triple double-quoted strings which I think do the same thing.

Maybe we should just keep the triple double-quoted string as is, and change the double-quoted strings to not span multiple lines.

PRQL / prql

Block string literals #3783

Examples