NetLogo / CSV-Extension

A NetLogo extension for reading CSV
5 stars 6 forks source link

Importing a line containing `\"` throws RuntimeException #8

Open qiemem opened 9 years ago

qiemem commented 9 years ago

Reported by @dougedmunds in NetLogo/NetLogo#845. That bug has a more extensive example, but the following is sufficient to reproduce:

csv:from-row "\"\\\" \\\"\""

That is, attempting to parse the string "\" \"" results in:

java.lang.RuntimeException: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
dougedmunds commented 9 years ago

Also

 csv:from-string "\"\\\" \\\"\""
qiemem commented 9 years ago

The problem here is that the CSV parser is seeing the quote in \" as the closing quote for the cell. So in the entry:

"plot count scouts with [task-string = \"watching-dance\"]"

the parser matches the first \" with the opening quote, so that it thinks "plot count scouts with [task-string = \" is the cell. But then it sees a bunch of other stuff before it sees a delimiter, so it chokes.

One of the tough parts about CSV is that there isn't really any agreed upon standard. Besides people using all sorts of different delimiters, cell quotation practices, and so forth, how special characters are escaped varies as well. Unlike in many applications, escaping things like newlines and tab characters is not necessary, as you can just stick them in a quoted field. Quotation marks do need to be escaped however. As I understand it, most software that uses CSV escapes quote marks by putting two quote marks in a row (Excel is a notable example). So the above line would be written:

"plot count scouts with [task-string = ""watching-dance""]"

So here's the dilemma. I could specify \ as an escape character, but then that messes up files coming from programs that use it as a regular character (such as Excel). Alternatively, I could add another optional argument to the procedures in this extension to specify the escape character, but that complicates the API significantly. I'm not sure what the best solution is.

As a workaround, you can replace instances of \" in your strings with "". The string extension's rex-replace-all procedure would be particularly useful for this. Let me know if you'd like help with this.

dougedmunds commented 9 years ago

Hello Bryan,

I just posted about a tool I wrote on StackOverflow.com that shows the code behind widgets. See http://stackoverflow.com/questions/32410338/how-can-i-see-the-code-behind-all-the-widgets-on-a-netlogo-interface

I made a note in the readme file about the csv issue. As far as I'm concerned, moving the code out of the widget is the best solution for now.

You can see how where the embedded double quotes shows up, if you look at some of the models in the models library that have plots . For example in the "Disease Solo.nlogo" the Number Sick plot has this code inside it: "create-temporary-plot-pen word \"run \" run-number\nset-plot-pen-color item (run-number mod 5)\n [blue red green orange violet]" "plot num-sick"

That is actually code for two parts of the interface, plot setup and plot update. That's one of the two areas where I use csv:from-row to separate the two strings.

The other areas is in Pens, where NetLogo uses one line to store 7 values on it (strings and numbers).

I found using csv was way easier than trying to otherwise separate the parts of the line. If you look at the code in nivi.nlogo, you'll see that the while loop in "parse-file" simply reads the file line by line using "set mydata file-read-line". file-read-line generates a string, which is fed to csv:from-row in those two places.

If you can think of a way to break the strings into parts without using csv:from-row and avoid the runtime error, let me know.

-- Doug Edmunds On 9/4/2015 9:38 PM, Bryan Head wrote:

The problem here is that the CSV parser is seeing the quote in |\"| as the closing quote for the cell. So in the entry:

|"plot count scouts with [task-string = \"watching-dance\"]"|

the parser matches the first |\"| with the opening quote, so that it thinks |"plot count scouts with [task-string = \"| is the cell. But then it sees a bunch of other stuff before it sees a delimiter, so it chokes.

One of the tough parts about CSV is that there isn't really any agreed upon standard. Besides people using all sorts of different delimiters, cell quotation practices, and so forth, how special characters are escaped varies as well. Unlike in many applications, escaping things like newlines and tab characters is not necessary, as you can just stick them in a quoted field. Quotation marks do need to be escaped however. As I understand it, most software that uses CSV escapes quote marks by putting two quote marks in a row (Excel is a notable example). So the above line would be written:

|"plot count scouts with [task-string = ""watching-dance""]"|

So here's the dilemma. I could specify || as an escape character, but then that messes up files coming from programs that use it as a regular character (such as Excel). Alternatively, I could add another optional argument to the procedures in this extension to specify the escape character, but that complicates the API significantly. I'm not sure what the best solution is.

As a workaround, you can replace instances of |\"| in your strings with |""|. The string extension https://github.com/NetLogo/String-Extension/'s |rex-replace-all| procedure would be particularly useful for this. Let me know if you'd like help with this.

— Reply to this email directly or view it on GitHub https://github.com/NetLogo/CSV-Extension/issues/8#issuecomment-137910261.

qiemem commented 9 years ago

Hi Doug,

You can get pretty far using file-read to parse the plot and pen lines in question. You should be able to parse line-by-line like you're doing, except when you know you're about to hit a plot setup/update line or pen line. At that point, you invoke file-read enough times to read in every entry in the line, and then switch back to reading line-by-line

However, this fails when dealing with multiple pens as you don't know how many pens there are, so you don't know when to switch back to file-read-line. There a couple of solutions I can think of, of varying levels of dirtiness. Probably the best is to just read the entire file in line-by-line to count the number of pens in each plot, and then read it in again to actually import the information. Dirty, but you don't have to write your own parser.

I'll keep thinking about it though. Sorry there isn't a simpler solution.

dougedmunds commented 9 years ago

My dirty solution to how many pens is that there is a blank line after the last pen. If my loop finds "PENS" it loops through them until it gets to the blank line.

On 9/7/2015 11:43 AM, Bryan Head wrote:

Hi Doug,

You can get pretty far using |file-read| http://ccl.northwestern.edu/netlogo/docs/dictionary.html#file-read to parse the plot and pen lines in question. You should be able to parse line-by-line like you're doing, except when you know you're about to hit a plot setup/update line or pen line. At that point, you invoke |file-read| enough times to read in every entry in the line, and then switch back to reading line-by-line

However, this fails when dealing with multiple pens as you don't know how many pens there are, so you don't know when to switch back to |file-read-line|. There a couple of solutions I can think of, of varying levels of dirtiness. Probably the best is to just read the entire file in line-by-line to count the number of pens in each plot, and then read it in again to actually import the information. Dirty, but you don't have to write your own parser.

I'll keep thinking about it though. Sorry there isn't a simpler solution.

— Reply to this email directly or view it on GitHub https://github.com/NetLogo/CSV-Extension/issues/8#issuecomment-138354847.

qiemem commented 9 years ago

That works when reading everything with file-read-line, but doesn't work with file-read since it skips over newlines.

dougedmunds commented 9 years ago

I developed some code similar to your suggestion of using the string extension's rex-replace-all procedure. I want the model to work 'straight out of the box', without requiring any extensions not already included with NetLogo 5.2.

After running file-read-line, it now looks for the slash-doublequote in the string. If found, it substitutes "@@". Then it runs csv:from-row. Finally it substitutes back the slash-doublequote for any @@. To cover the bases, it tests for @@ in the original string. If found, it just reports the original string, without using csv:from-row.

This avoids the runtime error problem, afaik.