Open jondegenhardt opened 4 years ago
Does the merge into v2.1.2 for #320 imply that, for newline handling going forward, it might be easiest to adopt option 3?
Option 3, reading both newline forms, but writing Unix newlines
In any case, while waiting for a full windows release/build, could just csv2tsv.exe be made availalble, assuming it passes any needed tests? This would enable Windows users to generate valid tsv from csv without excel, perhaps by:
foo,22
it becomes "foo","22"
in the "standardized" csv file (not sure why).csv2tsv.exe foo.csv > foo.tsv
Note that csv2tsv by default removes double-quotes where not needed, so foo.tsv would be fooTAB22
May need some documentation on how to pass escaped command-line arguments to csv2tsv.exe in windows if using cmd or powershell... For example, on linux/macos, we can create a file with scsv (semi-colon separated values) using something like these:
csv2tsv --tsv-delim $";" foo.csv > foo.scsv
csv2tsv --tsv-delim $';' foo.csv > foo.scsv
csv2tsv --tsv-delim \; foo.csv > foo.scsv
And I'm thinking none of the above command-lines would work on windows. Perhaps ^;
would work per which-symbol-is-escape-character-in-cmd
But if you need to specify tab as a command line argument, then instead of cmd windows folks may need to use powershell, which can escape tab as backtick-t
`t
per About special characters in PowerShell docs
Finally, another work-around avoiding both cmd and powershell completely, just install git-for-windows (choco install git
or some other bash shell for windows). Then run csv2tsv.exe in that shell, if csv2tsv.exe can handle arguments passed to it from bash.exe.
Hi @porteusconf. Thanks for the feedback and suggestions. Some comments in-line below.
Does the merge into v2.1.2 for #320 imply that, for newline handling going forward, it might be easiest to adopt option 3?
Option 3, reading both newline forms, but writing Unix newlines
Option 1, Unix newlines only on both input and output is by far the easiest (lowest investment cost). Option 3 is a fair bit more expensive. Much of this comes from increased test suite cost. Some because there are a several tools that have their own reader functionality (for example, tsv-sample
).
A relevant question is how much additional benefit would be seen investing in option 3? It's a question I don't know the answer to. How many users, how prevalent are the data files, and how onerous are the alternatives, such as invoking dos2unix
on the data first.
In any case, while waiting for a full windows release/build, could just csv2tsv.exe be made availalble, assuming it passes any needed tests? This would enable Windows users to generate valid tsv from csv without excel, ...
Well, I'm reluctant to create pre-built binary packages for only a single tool. However, I see the merit behind this idea, perhaps there are ways to get the desired effect.
First, note that nothing prevents cloning the git repo and building the tools on Windows. The test suite is not complete for Windows, but that doesn't mean the tools won't work properly. And to your point, csv2tsv
would likely passes a more complete test suite simply because the csv2tsv
test suite already includes examples of files with Windows newlines.
What could be done in this regard is to: (a) Publish test suite status info for csv2tsv
by itself; (b) Add any missing csv2tsv
tests; (c) Add specific instructions describing how to build on Windows.
perhaps by:
- validate foo.csv with some tool(s), for example https://csvlint.io/ But be warned, if you download the "standardized" csv they offer, I think it silently adds double-quotes around every field, including numbers. For example if foo.csv has a row
foo,22
it becomes"foo","22"
in the "standardized" csv file (not sure why).csv2tsv.exe foo.csv > foo.tsv
Note that csv2tsv by default removes double-quotes where not needed, so foo.tsv would befooTAB22
csv2tsv
doesn't have any trouble reading any of these formats, but as you point out, it always generates escape-free TSV.
May need some documentation on how to pass escaped command-line arguments to csv2tsv.exe in windows if using cmd or powershell...
Good thoughts, thank you.
Finally, another work-around avoiding both cmd and powershell completely, just install git-for-windows (
choco install git
or some other bash shell for windows). Then run csv2tsv.exe in that shell, if csv2tsv.exe can handle arguments passed to it from bash.exe.
Agreed, it might make sense to include this option in the documentation.
Status?
Status?
Status as described in the main description is up-to-date. It is updated as things change. At present, there are no known failure cases on Windows. But, since the test suite doesn't run fully, it leaves unknowns. Also, there's a lack of real-world use on Windows, or at least use that gets reported. So it is more about unknowns at this point.
Do you have specific questions?
No, I was just wondering why there weren't any Windows binaries. I've put them here for anyone interested: https://github.com/Imperatorn/tsv-utils/releases
@Imperatorn Great to know there's someone trying the tools on Windows! Report any problems you have, Windows related or otherwise!
This issues tracks the status of native Windows builds for the toolkit.
Windows is not currently a supported platform. Windows users are encouraged to run Linux builds of the tools using either Windows Subsystem for Linux (WSL) or Docker for Windows.
Acknowledging the above caveat, it is useful to track known issues on Windows and resolve them over time. At present, the toolkit is built on Windows as part of CI tests, and large parts of the test suite run successfully. Cases where the test suite does not run successfully are primarily due to limitations in the test suite, not the tools. At this time there are no known bugs in the tools, but there are gaps in the runnable test suite and limited real-world testing.
The two main issues for having Windows support are having a complete CI test suite that runs on Windows, and resolving inconsistencies in Windows newline handling. Most other known issues are minor, though a couple complicate having a test suite shared across Windows and Unix platforms. With PRs #314 and #320 newline consistency issues are largely addressed.
Windows CI Test suite status
A Windows CI test suite was setup using GitHub Actions. See PRs #313 and #315. Current status:
dub
, DMD 64-bit - Done (passses)dub
, LDC 64-bit - Done (passses)make
, DMD 64-bit - Done (passses)make
, LDC 64-bit - Done (passses)make unittest
, DMD 64-bit - Passes, with minor caveats. See "Disabled unit tests" below.make unittest
, LDC 64-bit - Fails intsv-sample
due to differences in floating point format output. As an example, a floating point value formatted as0.75346697377181959
on Linux/MacOs is formatted as0.75346697377181970
on Windows. This looks like a round-off difference in a mathematical calculation, nothing more.make test
, both DMD and LDC. These fail from the beginning because of newline differences between the golden set files from the repo and the results produced in the tests. In particular, the "error" and "help" tests that generate messages to end users are getting written using Windows newlines, but the "gold" files with expected results were generated on Unix. The main complication is that many expected results should be newline equivalent, so different strategies are needed for each.Notes:
common.tsv-utils.numerics.formatNumber
function fail on Windows and were disabled. The cases areformatNumber(0.6, 0)
andformatNumber(-0.6, 0)
. These should return"1"
and"-1"
respectively, but on Windows return"0"
and"-0"
. This is due to incorrect results from the calls being made tostd.format.format
.std.format.format
is likely callingsnprintf
from platform's C library. The disabled unit tests are in common.tsv-utils.numerics.d starting at lines 220 and 334.Update (03/14/21): This has been addressed in DMD 2.096, PR #7757 by moving the work into Phobos. Unit tests have been conditioned on version.
Windows Newline handling
On Unix and MacOs
tsv-utils
requires and generates Unix newlines. However, a newline handling policy has never been identified for running on a Windows platform. As a result, tools are inconsistent in the manner they handled Windows newlines when running on Windows. Some possible newline handling policies:Option 1 is simplest policy to support and what is being done initially. It is the easiest to enforce in the current code, and easiest to support in the current test suite. And, it is a reasonable choice in many environments, especially in circumstances where a mix of Unix and Windows platforms are in use. If data files are being shared, Unix newlines will normally be preferred. Option 1 is also consistent with other choices made in the toolkit. In particular, supporting only one file format (UTF-8 TSV), and delegating conversion to that format to other tools (e.g.
dos2unix
,csv2tsv
).Option 1 is largely in place with PRs #314 and #320, but the test suite still needs work to test it fully. Tasks:
Option 2 might be the preferred option in many traditional applications, but it is not clear if this is a good choice for data mining tools. In particular, it is very common to share data files between people, platforms, and tools. In such environments Unix newlines will be preferred. Switching to Windows newlines on Windows machines may be more an annoyance than a benefit.
Option 3, reading both newline forms, but writing Unix newlines, has some nice properties. And, it might be easier done than expected, as most tools use bufferedByLine. In particular, bufferedByLine.front handles newlines. However, a number of tools have their own reader functionality, so it would still be necessary to have a test suite for each tool. And, it is not really necessary given the availability of tools like
dos2unix
. Still, this option may be worth consideration.Option 4, full customization of newline handling, would provide the most complete solution. However, it has a material downsides. It creates additional user complexity in the form of additional command line arguments. It also creates complexity in the tools and test suite. At present these downsides seem to outweigh the benefits.
Other issues
tsv-pretty
when printing floats in formatted forms. Appears to affect a relatively small number of forms, though it is undesirable when it occurs.Update (03/14/21): This has been addressed in DMD 2.096, PR #7757 by moving the work into Phobos. This should eliminate this problem, especially when the LDC release with this version is available.