SciNim / Datamancer

A dataframe library with a dplyr like API
https://scinim.github.io/Datamancer/datamancer.html
MIT License
133 stars 7 forks source link

readCsv does not consider quotes for values as expected #58

Closed smurgle closed 8 months ago

smurgle commented 12 months ago

Hi first of all thank you for your wonderful work on Datamancer, really nice library. This is a glitch (at least for me) I've found.
I'm testing on Datamancer 0.3.17 / Windows 10. If a CSV fileld is surrounded by double quotes, I would expect that:

1) Double quotes are "consumed", i.e. the extracted string will not contain double quotes anymore 2) They should allows to ignore all characters that would otherwise act as separators, line breaks, ... within these as described in the documentation

This is what does happen with std/parsecsv, but apparently does not happen with datamancer's readCsv for values (while it's ok for column names). I've enclosed a very small snippet to show the concept for point 1. One obvious drawback (maybe it's wanted behaviour, but it would be nice to give the user the option to chose parseCsv-like behaviour) it is that digit-only columns (like "Two" and "Four" in the example) cannot be converted to float or int without applying a string modification before. While CSV like this may appear exotic, they are not.. if someone decides to export a DB table to a csv file and He/She is not sure about end-user regional settings or multi-line fields (e.g. "Note"), the enclose in double quotes every field is a robust approach.

import std/[parsecsv,strutils], datamancer
var p: CsvParser
let content = """"One","Two","Three","Four"
"a1","2","a3","4"
"b10","20","b30","40"
"c100","200","c300","400"
"""
writeFile("temp.csv", content)
p.open("temp.csv")
#p.readHeaderRow()
while p.readRow():
  echo(join(p.row,","))
p.close()

#[double quotes are consumed when using std/parsecsv, echo display this:

One,Two,Three,Four
a1,2,a3,4
b10,20,b30,40
c100,200,c300,400

]#

let csvfile = "temp.csv"
let df = readCsv(csvfile, quote = '"') #no better result with '\"'
echo(df.pretty())

#[double quotes are not consumed as expected, df.pretty() show this for values, not only for digit-only strings 

DataFrame with 4 columns and 3 rows:
     Idx          One          Two        Three         Four
  dtype:       string       string       string       string
       0         "a1"          "2"         "a3"          "4"
       1        "b10"         "20"        "b30"         "40"
       2       "c100"        "200"       "c300"        "400"
]#

# FURTHERMORE, If a quoted field is containing newline i.e. \n, parsing will fail
Vindaar commented 10 months ago

Sorry for ignoring your issue for so long!

Yeah, that is a fair point. The current CSV parser still has a bunch of rough edges. One can always fall back to

let df = toDf(readCsvAlt("foo.csv"))

which uses the old CSV parser based on std/parsecsv. I haven't tested it now, but it should behave as you expect.

I'll fix it though. Quoting is supported in the fast CSV parser, but consuming the quotes apparently not working correctly.

edit: The "numbers as names" should be quite doable at least as an option, I'll look into it.

Vindaar commented 10 months ago

About to push a first fix for these things to this branch:

https://github.com/SciNim/Datamancer/tree/moreMiscThings

It will probably be another week or two until I will fully PR and merge that. But it would be great if you could check out that branch and see if it works as expected. At least your example is working fine for me (including a 5th column of string digits).

smurgle commented 10 months ago

Hi Vindaar, don't worry, I didn't expect a fast reply, and anyway thanks for the feedback and the suggested workaround - readCsvAlt( ) - I was not aware of. I tested the new branch (that forced me to "learn" how, I'm still really green... it turned to be as simple as

nimble install datamancer@#6975a06

) Indeed new branch seems to work, it consumes double quotes now. Nice work! Still it does not manage another edge case as good as readCsvAlt( ), i.e. the linefeed ( \n ) within a field. This case might be a bit less frequent, but it might happen.
I've read a few times Araq to warn people about "simplified" csv parsing. So I would suggest to not drop readCvsAlt( ) option... albeit maybe slower, it proves to be still valuable to handle file like this. Here a small snippet:


# If a quoted field is containing newline i.e. \n, parsing will still fail
#mytemp2.csv is such a nasty csv file:
#[
"One","Two","Three","Four"
a1,2,"line1
line2",4
b10,20,"lineA
lineB",40
c100,200,"c300",400 
]#

let csvfile2 = "mytemp2.csv"
let df3 = toDf(readCsvAlt(csvfile2)) #it does work
echo(df3.pretty())

#[

DataFrame with 4 columns and 3 rows:
     Idx          One          Two        Three         Four
  dtype:       string          int       string          int
       0           a1            2  line1
line2            4
       1          b10           20  lineA
lineB           40
       2         c100          200         c300          400

]#
let df4 = readCsv(csvfile2, quote = '"') #it crashes
echo(df4.pretty())

This is the output


c:\Users\Andrea\Documents\Nim\testing_code\nimdata\parsetest5.nim(37) parsetest5
C:\Users\Andrea\.nimble\pkgs2\datamancer-0.3.17-6734c99420fb8a160d7a7cebcb14ed3a0cdf61f5\datamancer\io.nim(715) readCsv
C:\Users\Andrea\.nimble\pkgs2\datamancer-0.3.17-6734c99420fb8a160d7a7cebcb14ed3a0cdf61f5\datamancer\io.nim(561) readCsvTypedImpl
Error: unhandled exception: Line counts mismatch. 4 lines read, expected 5. Is your file using non unix line breaks? Try switching the `lineBreak` and `eat` options to `readCsv`. [IOError]
Error: execution of an external program failed: 'c:\Users\Andrea\Documents\Nim\testing_code\nimdata\parsetest5.exe'
Vindaar commented 10 months ago

I pushed a fix to allow parsing files with line breaks in quoted fields. To activate that, pass the allowLineBreaks = true argument to the readCsv procedure.

I've read a few times Araq to warn people about "simplified" csv parsing.

It's something @c-blake and me have discussed at length (in private) in the past. CSV is pretty horrible as there's not even any standards to follow. "simplified" compared to what? In my personal opinion the moment one starts to use CSVs for more than simple int/float/bool/short basic strings data it is very ill advised. I'm aware certain fields use CSVs for all sorts of things, but well. Attempting to parse any possible bizarre version of what someone calls a CSV file is indeed a tricky task. If I had to do this for my work I would probably perform an intermediate parsing with an external, very robust CSV parser, save the data into a more suitable format before continuing.

But all that's besides the point. I'll continue to try and improve what we have here, at least if the additions / improvements remain simple.

c-blake commented 10 months ago

@smurgle - You might find c2tsv.nim, c2tsvs.nim or an ANSI C variant c2dsv.c. If you are looking for a pithy slogan, "Don't just parse - convert for fast loading" might be a good one. If you don't have disk space for 2 copies, the pipeline conversion idea from Unix mentioned in those c2* docs is also available on Windows via popen (see template popent in cligen/osUt.nim, as are RAM filesystems to avoid IO (when you probably do have space).

One example benefit of converting to a soundly newline-splittable format is that once that conversion is done, you can (often) parallelize file processing over big chunks with something like cligen/mslice.nPart. It's usually even more efficient to convert all the way to binary like nio or HDF5, of course where you retain random access to records. (Most efficient of all is to not generate ASCII text in the first place but simply run off of live binary files, but we don't always control our sources of data/interchange or have file space for more copies. Another example where this bites is if the file is compressed with zip or gzip and so not random access for nSplit work.)

smurgle commented 10 months ago

@Vindaar, let me clarify I didn't mean to bother you, I'm sorry if my words ("simplified") eventually offended you. I'm aware that apply a readCsv relying on Nim's std/parsecsv off the shelf would have been much less work for you , and that performance vs completeness requires often a trade-off. Does it make sense to support a case happening maybe 1% of times at expense of being slower of say 20% for remaining 99%? The answer is context-dependent. IMHO it makes sense at least to keep the slower solution as an option. I agree than CSV is not the best format, although convenient solutions often beat optimal solutions. In my limited experience CSV (or even better, zipped CSV) is pretty good when compared say to xlsx (compressed XML). Pandas/Python has a relatively fast read_csv implementation (even when reading single zipped/gzipped files), while read_excel is painfully slow. Pandas read_csv is one of the most important "selling point" of the entire library, and it does ship with a myriads of options (and yes, also alternative parsing engines: 'C', 'python', 'pyarrow'), and does not break even with such inconvenient CSV format like in my example (newline within a field enclosed in double-quotes). So I was just suggesting to not drop support of "edge cases" like the one I mentioned. By the way, even if CSV was born before any standardization, IETF RFC 4180 has started a sort of standardization attempt, and it states: "(...) Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes". I know that CRLF are often a major source of hassles... there are the CRLF I would happily avoid, (like in my nasty CSV example) as well as the CRLF I would hope to stay there forever, like the ones stripped-out to (mildly) reduce a huge XML file size, that instantly killed my "non-standard XML" awk-parser utility, forcing to me to look for alternatives and... landing to a new (for me) beautiful, fast, elegant, feature-rich, powerful programming language backed by a small but vibrant community of many talented developers. That language is Nim. Sometimes adversities lead to unexpected rewards!!! ;)

c-blake commented 10 months ago

RFC 4180 has started a sort of standardization attempt

Directly quoting the top of RFC 4180:

Status of This Memo

This memo provides information for the Internet community. It does not specify an Internet standard of any kind.

(emphasis mine) Maybe believe the document itself instead of Internet chatter on the topic?

But anyway, c2tsv.nim or c2dsv.c which both handle \r\n or \n for a superset of 4180 formats and are also both likely faster than the Pandas read_csv (when compiled with full optimizations). Just use those as a first stage filter either in real-time or up-front & done. Then commas & newlines will actually S)eparate your allowed & realized fields & records (more per what people call "TSV", instead of being the oxymoron that quoted-escaped C-"S"-V from Excel is) and you can just use vanilla awk (or whatever simple consumer, like probably Datamancer) on the data directly.

Conversion may sound slower than parsing, but you probably have a spare 1/2 CPU core across a pipeline. It is probably faster, depending upon pipe bandwidth. I measured c2tsv.nim at ~7X faster than Nim std/parsecsv which Araq comments would probably point you at, for example. At least on Linux, pipe bandwidth is usually much higher than splitting even with string creation elision. You can chain together conversion such as popen("gunzip -dc < foo.csv.gz | c2tsv") aka catz foo.csv.gz|c2tsv|awk -F "$tab" ... (with some prior IFS='' tab="$(printf \\t)" to establish "$tab" for you on Unix, which I am assuming because you mentioned awk). For that pipeline, gunzip will be your bottleneck because it is very slow. { There are some bioinformatics guys with a parallel version called bgzip, but data must be compressed in independent blocks for that to decompress any faster - you may want to read this whole thread. }

smurgle commented 10 months ago

@c-blake, "a sort of" and "attempt" words were meant to make the difference. Yet many IETF RFCs are influential even before (or without) becoming a standard. While this memo has no explicit ambition to become a standard, it request to register "csv/txt" as Mime type and provides a guidance: "This section documents the format that seems to be followed by most implementations". Thank your for your suggestions about csv/tsv converters, I will look to them to see if they can fit my needs. I'm on Windows (mostly) but I managed to install a few gnu/linux utilities with Gow, and I also installed mawk (a faster-than-gawk implementation where scope have been narrowed a little - e.g. avoid large integers operations on that - in favour of speed).

Vindaar commented 10 months ago

@Vindaar, let me clarify I didn't mean to bother you, I'm sorry if my words ("simplified") eventually offended you

Just came to my computer to check something else and saw your message. Don't have time to read everything, but let me just be very clear that at no point have I felt offended. Sorry, if it came across that way. :heart: