eBay / tsv-utils

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
https://ebay.github.io/tsv-utils/
Boost Software License 1.0
1.42k stars 80 forks source link

Inconsistent newline handling on Windows #310

Closed jondegenhardt closed 3 years ago

jondegenhardt commented 3 years ago

This is a follow-on to issue #307, which identified issues with newline handling on Windows.

Note: Windows is not a currently supported platform. However, it is certainly desirable to have it run on Windows, making it worthwhile to identify issues needing to be addressed on Windows.

The general problem is that the different read and write mechanisms used in the different tsv-utils tools are not consistent with respect to handling of newlines on the Windows platform. In some cases files are read in binary mode, in other cases in text mode. Similar for writing. (Text mode engages newline translation on Windows in the low-level I/O routines.) At present the exact behavior of each tool is not known, this requires further evaluation.

There are a couple things that need to be done. One is to have complete test suite. The other is to choose newline policies. These would tested in the test suite.

Some possible newline handling policies:

  1. Read and write Unix newlines only, on all platforms.
  2. Read and write using platform preferred linefeeds.
  3. Read either Windows or Unix linefeeds; Write Unix linefeeds
  4. Full customization via command line arguments.

The simplest policy to support would be to restrict the tools to Unix newlines. Require Unix newlines on input and output only Unix newlines. The existing test suite would largely support this. And, it would be the correct choice in many environments, especially in circumstances where a mix of Unix and Windows platforms are in use. That is, if data files are being shared, Unix newlines will normally be preferred.

Reading either form (option 3) might be easier done than expected, as most tools use bufferedByLine. In particular, bufferedByLine.front handles newlines. However, a number of tools have their own reader functionality, so it would still be necessary to have a test suite for each tool.

Full customization of newline handling has a material downside, in that it create additional user complexity in the form of additional command line arguments.

jondegenhardt commented 3 years ago

This has been partially addressed in PR #314. That PR takes care of most of the items described under approach 1, supporting Unix newlines. The longer term handling of Windows newlines is now being covered under issue #317.

jondegenhardt commented 3 years ago

PR #320 addresses the other significant issue to get to approach 1, detection of Window newlines.