T-F-S / csvsimple

A LaTeX package for lightweight CSV file processing.
http://www.ctan.org/pkg/csvsimple
LaTeX Project Public License v1.3c
24 stars 5 forks source link

CSVSimple ignoring line when enclosed field in double quotes has comma in it #19

Closed Girgias closed 1 year ago

Girgias commented 2 years ago

So this was initially posted on TexExchange: https://tex.stackexchange.com/questions/630737/csvsimple-ignoring-line-when-enclosed-field-in-double-quotes-has-comma-in-it

But it might be a bug in the library.

The library doesn't follow RFC 4180 with how to deal with enclosed fields (fields which start with " DOUBLE_QUOTE). As it should not interpret a , as a delimiter between fields.

Currently it does, meaning that it will remove the line as it's finding more columns than expected. I got around by bodging it by using no check column count, but that seems rather suboptimal.

I'll try to have a look at the source code to see how the parsing is done. But until then I'll open an issue to be able to track this.

T-F-S commented 2 years ago

This is not a bug, but a documented restriction (page 3):

grafik

If you do not want to change grouping by hand, page 56 describes data transformation using the CSV-Sorter program.

lvjr commented 2 years ago

(Copy from my answer to the above question on TeX.SE)

You may hack \__csvsim_read_line: command in csvsimple-l3, using l3regex to replace every "..." with {...}, as long as there are no "" or newlines in it.

\documentclass{article}

\usepackage{csvsimple-l3}
\usepackage{longtable}
\usepackage{etoolbox}

\begin{filecontents*}[overwrite]{\jobname.csv}
Rank,Name,Median_score,Cooperation_rating,Wins,Initial_C_rate,CC_rate,CD_rate,DC_rate,DD_rate,CC_to_C_rate,CD_to_C_rate,DC_to_C_rate,DD_to_C_rate
0,Evolved ANN 5 Noise 05,2.4905443548387094,0.4924381048387097,126.5,0.9506048387096774,0.3979901209677421,0.09444798387096777,0.19763709677419344,0.3099247983870967,0.8131456249267315,0.6305697857143637,0.43896754433182866,0.16263751323498196
1,"DBS: 0.75, 3, 4, 3, 5",2.431239919354839,0.3988991935483871,135.0,0.9509274193548387,0.3164965725806451,0.08240262096774195,0.22036794354838693,0.38073286290322605,0.8341798279957744,0.49733465333274995,0.27789064893581683,0.22881418670824838
2,Second by RichardHufford,2.3971673387096772,0.40838165322580644,132.0,0.9488709677419355,0.3064282258064516,0.10195342741935484,0.22199536290322577,0.36962298387096776,0.8199664917952298,0.15203283502430787,0.6611790544745667,0.07946789603790326
3,Revised Downing,2.3799798387096773,0.39448004032258066,119.0,0.9506048387096774,0.2698252016129031,0.1246548387096774,0.24078346774193546,0.364736491935484,0.7081181279100983,0.419492212806942,0.38152788294217344,0.25469205691448726
4,Evolved FSM 16 Noise 05,2.37633064516129,0.44351814516129034,111.0,0.9502419354838709,0.3406358870967743,0.1028822580645161,0.1989620967741934,0.3575197580645161,0.8603392240480966,0.48914238424923623,0.4119413240129918,0.09541845008601585
5,Evolved ANN,2.372368951612903,0.41917661290322583,136.0,0.9515322580645161,0.3513897177419357,0.0677868951612903,0.1842770161290322,0.3965463709677418,0.8133328916697765,0.6177725034894199,0.29406405406264435,0.07026965468508092
6,Second by Borufsen,2.316108870967742,0.5484534274193549,116.0,0.9478225806451613,0.42302157258064493,0.1254318548387097,0.1488096774193549,0.3027368951612904,0.9150910248577407,0.13194067320351477,0.6821776531424888,0.144335054814451
\end{filecontents*}

\ExplSyntaxOn
\appto\__csvsim_read_line:{
  \tl_set_eq:NN \l_tmpa_tl \csvline
  \regex_replace_all:nnN { "([^"]+)" } { {\1} } \l_tmpa_tl
  \tl_gset_eq:NN \csvline \l_tmpa_tl
}{}{}
\ExplSyntaxOff

\begin{document}

\csvreader[
  respect all,
  longtable = |r|l|,
  head to column names,
]{\jobname.csv}{}{
  \Rank & \Name
}

\end{document}

image

T-F-S commented 2 years ago

@lvjr That is a very nice example for applying the regular expression facilities of LaTeX3.

For csvsimple, I would not add this for the implementation, because of added computation time and problems with LaTeX csv files like:

A,"Ubel "ubel sprach der D"ubel und verschwand in der Wand
B,B"aren m"ogen es s"u"s

But, for the OP this could be very useful as a patch. Currently, I do not have much time, but I can image to add a hook or something similar to do \csvline manipulations more savely than patching.

lvjr commented 2 years ago

LaTeX kernel provides several hook commands (\NewHook, \AddToHook, \UseHook) for doing safe patching. Maybe csvsimple could use them.

T-F-S commented 2 years ago

Yes, I would go for adding such a hook.

T-F-S commented 1 year ago

Hook csvsimple/csvline added with example for double-quote replacement: https://github.com/T-F-S/csvsimple/releases/tag/v2.4.0