eBay / tsv-utils

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
https://ebay.github.io/tsv-utils/
Boost Software License 1.0
1.42k stars 80 forks source link

Status of Windows build #317

Open jondegenhardt opened 3 years ago

jondegenhardt commented 3 years ago

This issues tracks the status of native Windows builds for the toolkit.

Windows is not currently a supported platform. Windows users are encouraged to run Linux builds of the tools using either Windows Subsystem for Linux (WSL) or Docker for Windows.

Acknowledging the above caveat, it is useful to track known issues on Windows and resolve them over time. At present, the toolkit is built on Windows as part of CI tests, and large parts of the test suite run successfully. Cases where the test suite does not run successfully are primarily due to limitations in the test suite, not the tools. At this time there are no known bugs in the tools, but there are gaps in the runnable test suite and limited real-world testing.

The two main issues for having Windows support are having a complete CI test suite that runs on Windows, and resolving inconsistencies in Windows newline handling. Most other known issues are minor, though a couple complicate having a test suite shared across Windows and Unix platforms. With PRs #314 and #320 newline consistency issues are largely addressed.

Windows CI Test suite status

A Windows CI test suite was setup using GitHub Actions. See PRs #313 and #315. Current status:

Notes:

Windows Newline handling

On Unix and MacOs tsv-utils requires and generates Unix newlines. However, a newline handling policy has never been identified for running on a Windows platform. As a result, tools are inconsistent in the manner they handled Windows newlines when running on Windows. Some possible newline handling policies:

  1. Read and write Unix newlines only, on all platforms.
  2. Read and write using platform preferred linefeeds.
  3. Read either Windows or Unix linefeeds; Write Unix linefeeds
  4. Full customization via command line arguments.

Option 1 is simplest policy to support and what is being done initially. It is the easiest to enforce in the current code, and easiest to support in the current test suite. And, it is a reasonable choice in many environments, especially in circumstances where a mix of Unix and Windows platforms are in use. If data files are being shared, Unix newlines will normally be preferred. Option 1 is also consistent with other choices made in the toolkit. In particular, supporting only one file format (UTF-8 TSV), and delegating conversion to that format to other tools (e.g. dos2unix, csv2tsv).

Option 1 is largely in place with PRs #314 and #320, but the test suite still needs work to test it fully. Tasks:

Option 2 might be the preferred option in many traditional applications, but it is not clear if this is a good choice for data mining tools. In particular, it is very common to share data files between people, platforms, and tools. In such environments Unix newlines will be preferred. Switching to Windows newlines on Windows machines may be more an annoyance than a benefit.

Option 3, reading both newline forms, but writing Unix newlines, has some nice properties. And, it might be easier done than expected, as most tools use bufferedByLine. In particular, bufferedByLine.front handles newlines. However, a number of tools have their own reader functionality, so it would still be necessary to have a test suite for each tool. And, it is not really necessary given the availability of tools like dos2unix. Still, this option may be worth consideration.

Option 4, full customization of newline handling, would provide the most complete solution. However, it has a material downsides. It creates additional user complexity in the form of additional command line arguments. It also creates complexity in the tools and test suite. At present these downsides seem to outweigh the benefits.

Other issues

porteusconf commented 3 years ago

Does the merge into v2.1.2 for #320 imply that, for newline handling going forward, it might be easiest to adopt option 3?

Option 3, reading both newline forms, but writing Unix newlines

In any case, while waiting for a full windows release/build, could just csv2tsv.exe be made availalble, assuming it passes any needed tests? This would enable Windows users to generate valid tsv from csv without excel, perhaps by:

May need some documentation on how to pass escaped command-line arguments to csv2tsv.exe in windows if using cmd or powershell... For example, on linux/macos, we can create a file with scsv (semi-colon separated values) using something like these:

csv2tsv --tsv-delim   $";"    foo.csv  > foo.scsv
csv2tsv --tsv-delim   $';'    foo.csv  > foo.scsv
csv2tsv --tsv-delim    \;     foo.csv  > foo.scsv

And I'm thinking none of the above command-lines would work on windows. Perhaps ^; would work per which-symbol-is-escape-character-in-cmd

But if you need to specify tab as a command line argument, then instead of cmd windows folks may need to use powershell, which can escape tab as backtick-t

`t

per About special characters in PowerShell docs

Finally, another work-around avoiding both cmd and powershell completely, just install git-for-windows (choco install git or some other bash shell for windows). Then run csv2tsv.exe in that shell, if csv2tsv.exe can handle arguments passed to it from bash.exe.

jondegenhardt commented 3 years ago

Hi @porteusconf. Thanks for the feedback and suggestions. Some comments in-line below.

Does the merge into v2.1.2 for #320 imply that, for newline handling going forward, it might be easiest to adopt option 3?

Option 3, reading both newline forms, but writing Unix newlines

Option 1, Unix newlines only on both input and output is by far the easiest (lowest investment cost). Option 3 is a fair bit more expensive. Much of this comes from increased test suite cost. Some because there are a several tools that have their own reader functionality (for example, tsv-sample).

A relevant question is how much additional benefit would be seen investing in option 3? It's a question I don't know the answer to. How many users, how prevalent are the data files, and how onerous are the alternatives, such as invoking dos2unix on the data first.

In any case, while waiting for a full windows release/build, could just csv2tsv.exe be made availalble, assuming it passes any needed tests? This would enable Windows users to generate valid tsv from csv without excel, ...

Well, I'm reluctant to create pre-built binary packages for only a single tool. However, I see the merit behind this idea, perhaps there are ways to get the desired effect.

First, note that nothing prevents cloning the git repo and building the tools on Windows. The test suite is not complete for Windows, but that doesn't mean the tools won't work properly. And to your point, csv2tsv would likely passes a more complete test suite simply because the csv2tsv test suite already includes examples of files with Windows newlines.

What could be done in this regard is to: (a) Publish test suite status info for csv2tsv by itself; (b) Add any missing csv2tsv tests; (c) Add specific instructions describing how to build on Windows.

perhaps by:

  • validate foo.csv with some tool(s), for example https://csvlint.io/ But be warned, if you download the "standardized" csv they offer, I think it silently adds double-quotes around every field, including numbers. For example if foo.csv has a row foo,22 it becomes "foo","22" in the "standardized" csv file (not sure why).
  • csv2tsv.exe foo.csv > foo.tsv Note that csv2tsv by default removes double-quotes where not needed, so foo.tsv would be fooTAB22

csv2tsv doesn't have any trouble reading any of these formats, but as you point out, it always generates escape-free TSV.

May need some documentation on how to pass escaped command-line arguments to csv2tsv.exe in windows if using cmd or powershell...

Good thoughts, thank you.

Finally, another work-around avoiding both cmd and powershell completely, just install git-for-windows (choco install git or some other bash shell for windows). Then run csv2tsv.exe in that shell, if csv2tsv.exe can handle arguments passed to it from bash.exe.

Agreed, it might make sense to include this option in the documentation.

Imperatorn commented 3 years ago

Status?

jondegenhardt commented 3 years ago

Status?

Status as described in the main description is up-to-date. It is updated as things change. At present, there are no known failure cases on Windows. But, since the test suite doesn't run fully, it leaves unknowns. Also, there's a lack of real-world use on Windows, or at least use that gets reported. So it is more about unknowns at this point.

Do you have specific questions?

Imperatorn commented 3 years ago

No, I was just wondering why there weren't any Windows binaries. I've put them here for anyone interested: https://github.com/Imperatorn/tsv-utils/releases

jondegenhardt commented 3 years ago

@Imperatorn Great to know there's someone trying the tools on Windows! Report any problems you have, Windows related or otherwise!