consolidation / output-formatters

Apply transformations to structured data to write output in different formats.
Other
192 stars 13 forks source link

Field with any newline character is truncated #95

Open Adnan-cds opened 3 years ago

Adnan-cds commented 3 years ago

Hello, I was trying to create a CSV file where one of the fields lists multiple URLs. For peoples' viewing pleasure, I thought I will separate those URLs using "\n". Kind of like this:

"Source page node id", "Broken Links in this page"
42, "https://example.net/foo
https://example.net/bar
https://example.net/bar"
162, "https://example.net/qux"

But the output from drush insert-command-here --format=csv kept producing a truncated version like the following:

"Source page node id", "Broken Links in this page"
42, "https://example.net/foo
162, "https://example.net/qux"

Turned out the CSV formatter is keeping the first line of each CSV row only:

https://github.com/consolidation/output-formatters/blob/fb12960888a53b4767e87eee9847cf6c54848e11/src/Formatters/CsvFormatter.php#L127

Fix is easy (patch pasted below). I am happy to raise a pull request, but I am wondering if the current behaviour is intentional or not. Please advice.

--- /path/to/drupal/vendor/consolidation/output-formatters/src/Formatters/CsvFormatter.php     2020-10-11 05:15:32.000000000 +0100
+++ CsvFormatter.php    2021-03-12 20:03:39.327122512 +0000
@@ -123,9 +123,11 @@
         } else {
             fputcsv($buffer, $data, $delimiter, $enclosure);
         }
+        $buffer_length = ftell($buffer);
         rewind($buffer);
-        $csv = fgets($buffer);
+        $csv = fread($buffer, $buffer_length);
         fclose($buffer);
+
         return $csv;
     }
 }
greg-1-anderson commented 3 years ago

The current implementation of the csv formatter was not intended to support fields with embedded line breaks; however, I would not be opposed to a pull request to allow for this capability, provided that it included at least one test.

donquixote commented 1 year ago

From previous experience with csv in php: The php csv implementation is a bit silly in it using an "escape char". The more common format does not use an escape char, instead you would use " as the enclosure, and then double "" is a literal ". This is more sane, more common, and will be understood by most programs. I think it is not a real standard but an RFC that most people follow, except php.

This would do the trick, for comma as delimiter - but obviously you want to support different values for the parameters.

    protected function csvEscape($data, $delimiter = ',', $enclosure = '"', $escapeChar = "\\")
    {
        return implode(',', array_map(static function (string $value): string {
          if (!preg_match('@["\n,]@', $value)) {
            return $value;
          }
          return '"' . str_replace('"', '""', $value) . '"';
        }, $data)) . "\n";
      }

EDIT: A "\n" was missing.

greg-1-anderson commented 1 year ago

The definition of the csv data type in the output formatters project is that each record appears on a separate line; therefore, if embedded newlines are to be supported by this format, they would need to be converted to some form of escape character.

Your program could install a custom formatter, or replace the csv formatter with a custom implementation that behaves the way that you want, but we can't add embedded newlines to the csv formatter provided here.

donquixote commented 1 year ago

The definition of the csv data type in the output formatters project is that each record appears on a separate line;

May I ask why this definition exists in this way and is such a hard requirement? Is this mainly for BC, or is there a deeper reason?

If BC is a concern, we can look for ways around that.

therefore, if embedded newlines are to be supported by this format, they would need to be converted to some form of escape character.

This would not be understood by typical programs that can read csv, e.g. spreadsheet applications.

greg-1-anderson commented 1 year ago

Yes, it would require a major version to change the behavior. As for ways around it, perhaps a property could stipulate behavior, but it's already possible to install your own formatter, or replace a standard one if you'd prefer different behavior in your application.

donquixote commented 1 year ago

Would you accept an alternative csv formatter as part of this package? How would we name it? csv-rfc-4180? Or csv-multiline? https://www.rfc-editor.org/rfc/rfc4180 https://www.ietf.org/rfc/rfc4180.txt (same thing)

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

https://stackoverflow.com/questions/10140999/csv-with-comma-or-semicolon

Unfortunately, RFC 4180 is a request for comments, NOT a standard. It says right at the top -- "does not specify an Internet standard of any kind." Later, it says that RFC 4180 "documents the format that seems to be followed by most implementations."

(this is in one of the comments)

So, not a real standard, but still better than what php produces with native fputcsv().

I think the remaining problem with office applications is character encoding, which is independent from the escape char and enclosure syntax.

donquixote commented 1 year ago

it's already possible to install your own formatter

Actually, how? I see FormatterManager->addFormatter(), and I see how drush uses this to add a new HelpCLIFormatter(). But it does that within a specific command only ("help").

For context: Atm I am working with openeuropa/task-runner. Other times I work with drush. Atm I would want the formatter to be globally available in a website project. Other times I might want to ship the formatter with a shared package or module.

donquixote commented 1 year ago

Actually, the inconsistent handling in php has been fixed, it seems :)

In the docs for all the csv functions we find this:

escape The optional escape parameter sets the escape character (at most one single-byte character). An empty string ("") disables the proprietary escape mechanism.