GPSBabel / gpsbabel

GPSBabel: convert, manipulate, and transfer data from GPS programs or GPS receivers. Open Source and supported on MacOS, Windows, Linux, and more. Pointy clicky GUI or a command line version...
https://www.gpsbabel.org
GNU General Public License v2.0
475 stars 126 forks source link

When input cvs fields contain `; `, gpsbabel throws the "unicsv" Invalid latitude error #1286

Closed noyannus closed 3 months ago

noyannus commented 3 months ago

When input cvs fields contain ;, gpsbabel throws the "unicsv" Invalid latitude 1e+25 in waypoint error on the first data line (tried on various lines).

It works as expected with a field structure (header and data) of separated fields like so: "lat","lon","name","foo","bar","baz","blah","boom", "...", "..."

With some fields (foo and several after it) beforehand concatenated and separated with or with #, all is well also. This works: "lat","lon","name","foo – bar – baz – blah – boom – ...", "..." "lat","lon","name","foo# bar# baz# blah# boom# ...","..."

But if the separator of these already merged fields in the input are separated with a semicolon, the latitude (incidentally the first data field to be processed, maybe that means something?) is Invalid. "lat","lon","name","foo; bar; baz; blah; boom; ...","..." causes the error.

gpsbabel is called with "${<path/to/gpsbabel>}" -i unicsv -f "${TEMPFILE}" -o gpx -F "${OUTFILE}".

This is different from #591, as no fields option is involved.

Observed with GPSBabel Version 1.9.0 on openSUSE Tumbleweed 20240624, with Kernel 6.9.5-1-default (64-bit).

tsteven4 commented 3 months ago

The documentation says

If the first line contains any tabs, the data lines are assumed to be tab separated. Otherwise the fields are assumed to be separated by commas

What the code actually does is: If the first line contains a tab, the fields are assumed to be tab separated. Otherwise, if the first line contains a semicolon, the fields are assumed to be separated by semicolons.. Otherwise, if the first line contains a vertical bar, then the fields are assumed to be separated by vertical bars. Otherwise a comma separator is assumed.

Note that the header line and the data lines are all assumed to use one common separator to separate the fields, either a comma, tab, semicolon or vertical bar.

  unicsv_fieldsep = ",";
  if (header.contains('\t')) {
    unicsv_fieldsep = "\t";
  } else if (header.contains(';')) {
    unicsv_fieldsep = ";";
  } else if (header.contains('|')) {
    unicsv_fieldsep = "|";
  }
noyannus commented 3 months ago

Thank you for the quick reply and the explanation.

I think, all you stated as required is the case here. No tabs anywhere, header and data with consistent field separators, and the also possibly assumed comma separator is explicitly there.

The problem arises when the semicolon is inside a quoted and comma-separated data and/or header field. Because it is not the separator between the fields gpsbabel sees, but between content chunks of such fields, I surmise that the root cause lies not in the selection of the required separator or a separator inconsistency.

Below are anoymized first lines of the respective input files I tried. The complete first line is header labels.

all chunks as individual fields:
"lat","lon","name","xxxxxxx","xxxxxx","xxxx","xxxxxxxx","xxxxxx","xxxxxxx","xxxx","xxxxxxx","xxxxxx","xxxxxx","xxxxxxx_xxxx_0","xxxxxxx_xxxx_0"
"00.00000000","-0.000000000000000","Xxxxxxx","Xxxxxxxx","Xxxxxx","Xxxxxx Xxxxxxx","Xxxxxxx","Xxxxxxxx, XXX, Xxxxxx Xxxxxxx"
"00.0000000","-0.0000000","Xxxxxxx","Xxxxxxx","Xxxxxxxx","Xxxxxx","Xxxxxx Xxxxxxx","Xxxxxxx (Xxxxxxxx)","XX0 0XX, Xxxxxx Xxxxxxx"
"00.0000000","-0.0000000","Xxxxxxx","Xxxxxxxxxxx xxx Xxxxx","Xxxxxxxxx","Xxxxxx Xxxxxxx","Xxxxxxx (Xxxxxxxxxxx xxx Xxxxx)","XXX, Xxxxxx Xxxxxxx"
=> OK

chunks separated by " * "
"lat","lon","name","xxxxxxx * xxxxxx * xxxx * xxxxxxxx * xxxxxx * xxxxxxx * xxxx * xxxxxxx * xxxxxx * xxxxxx","xxxxxxx_xxxx_0 * xxxxxxx_xxxx_0"
"00.00000000","-0.000000000000000","Xxxxxxx","Xxxxxxxx * Xxxxxx * Xxxxxx Xxxxxxx","Xxxxxxx * Xxxxxxxx, XXX, Xxxxxx Xxxxxxx"
"00.0000000","-0.0000000","Xxxxxxx","Xxxxxxx * Xxxxxxxx * Xxxxxx * Xxxxxx Xxxxxxx","Xxxxxxx (Xxxxxxxx) * XX0 0XX, Xxxxxx Xxxxxxx"
"00.0000000","-0.0000000","Xxxxxxx","Xxxxxxxxxxx xxx Xxxxx * Xxxxxxxxx * Xxxxxx Xxxxxxx","Xxxxxxx (Xxxxxxxxxxx xxx Xxxxx) * XXX, Xxxxxx Xxxxxxx"
=> OK

ditto by " – "
"lat","lon","name","xxxxxxx – xxxxxx – xxxx – xxxxxxxx – xxxxxx – xxxxxxx – xxxx – xxxxxxx – xxxxxx – xxxxxx","xxxxxxx_xxxx_0 – xxxxxxx_xxxx_0"
"00.00000000","-0.000000000000000","Xxxxxxx","Xxxxxxxx – Xxxxxx – Xxxxxx Xxxxxxx","Xxxxxxx – Xxxxxxxx, XXX, Xxxxxx Xxxxxxx"
"00.0000000","-0.0000000","Xxxxxxx","Xxxxxxx – Xxxxxxxx – Xxxxxx – Xxxxxx Xxxxxxx","Xxxxxxx (Xxxxxxxx) – XX0 0XX, Xxxxxx Xxxxxxx"
"00.0000000","-0.0000000","Xxxxxxx","Xxxxxxxxxxx xxx Xxxxx – Xxxxxxxxx – Xxxxxx Xxxxxxx","Xxxxxxx (Xxxxxxxxxxx xxx Xxxxx) – XXX, Xxxxxx Xxxxxxx"
=> OK

separated by "; "
"lat","lon","name","xxxxxxx; xxxxxx; xxxx; xxxxxxxx; xxxxxx; xxxxxxx; xxxx; xxxxxxx; xxxxxx; xxxxxx","xxxxxxx_xxxx_0; xxxxxxx_xxxx_0"
"00.00000000","-0.000000000000000","Xxxxxxx","Xxxxxxxx; Xxxxxx; Xxxxxx Xxxxxxx","Xxxxxxx; Xxxxxxxx, XXX, Xxxxxx Xxxxxxx"
"00.0000000","-0.0000000","Xxxxxxx","Xxxxxxx; Xxxxxxxx; Xxxxxx; Xxxxxx Xxxxxxx","Xxxxxxx (Xxxxxxxx); XX0 0XX, Xxxxxx Xxxxxxx"
"00.0000000","-0.0000000","Xxxxxxx","Xxxxxxxxxxx xxx Xxxxx; Xxxxxxxxx; Xxxxxx Xxxxxxx","Xxxxxxx (Xxxxxxxxxxx xxx Xxxxx); XXX, Xxxxxx Xxxxxxx"
=> error

Of course, I could use a very obscure letter that nobody ever uses as temporary chunk separator, and run a tr over the output gpx file after the conversion. So the issue is not too serious.

tsteven4 commented 3 months ago

As I outlined above for your last case we detect a semicolon as the field separator. But, we dequote the line before trying to break it into fields. Since the semicolons are inside quotes they are treated as if they are escaped, i.e. they aren't used as delimiters.

We really should have recognized that the semicolons(or tabs, vertical lines, commas) were escaped when detecting the delimiter, but we don't.

As you suggest in your workaround you need to use something besides tab, semicolon, vertical line and comma as a temporary field separator.

tsteven4 commented 3 months ago

This works too (; as field separator, , as temporary field separator): lat;lon;name;foo, bar, baz, blah, boom, ...;... 1.0;2.0;this;that

noyannus commented 3 months ago

Yes, that could be the solution: The delimiter detection must not look into quoted fields.

Ensuring that would also allow unusual characters in text fields, e.g., a waypoint description could contain several lines (good for adresses).

tsteven4 commented 3 months ago

@noyannus can you test #1287? If necessary I can supply a snap that should run on tumbleweed.

noyannus commented 3 months ago

A snap would be better than me building the binary: you have total control over the production of the binary, and I'm somewhat busy this weekend...

tsteven4 commented 3 months ago

There is an unsigned test snap produced by CI at https://file.io/Bh7l85kMnDTR sudo snap install --dangerous gpsbabel_LinuxInstaller-20240628-f2ec5be_amd64.snap

tsteven4 commented 3 months ago

The snap has access to your home directory automatically. The are other manual connections available that you likely won't need to set up.

$ snap connections gpsbabel
Interface        Plug                      Slot   Notes
home             gpsbabel:home             :home  -
raw-usb          gpsbabel:raw-usb          -      -
removable-media  gpsbabel:removable-media  -      -
serial-port      gpsbabel:serial-port      -      -
noyannus commented 3 months ago

There is an unsigned test snap ...

Is the link correct? The page says "The transfer you requested has been deleted."

tsteven4 commented 3 months ago

Weird. It is suppose to be there for 2 weeks. I copied it up again https://file.io/7lZGc6tTlHOW

noyannus commented 3 months ago

Snap install went fine. Does gpsbabel now shadow the installed version from the gpsbabel.org/download.html installer $ which gpsbabel -> /bin/gpsbabel)? The version when I call it is the same: gpsbabel -V -> GPSBabel Version 1.9.0 / snap run gpsbabel -V -> GPSBabel Version 1.9.0 Asking because a script would have to be modified to make sure the right version gets tested

tsteven4 commented 3 months ago

We only roll the version number on release, so pre-release test versions like the snap you got have a misleading version number. However, there is some distinguishing information available (the Repository SHA and the Date). We have never officially released a snap. The snap you have is command line only (no GUI). I have built GUI snaps but they are much more problematic with all the graphics and webengine tentacles. Feedback on the desirability of a snap distribution is encouraged.

The snap version should be executable through /snap/bin/gpsbabel. Also "snap list gpsbabel" or "snap info gpsbabel" ties you back to the installer name which is based on the SHA and date.

 gpsbabel -D1
GPSBabel Version: 1.9.0
main: Repository SHA: f2ec5be
main: Date: 2024-06-28T23:09:44Z
main: Compiled with Qt 6.2.4 for architecture x86_64-little_endian-lp64
main: Running with Qt 6.2.4 on Ubuntu Core 22, x86_64
main: QLocale::system() is C
main: QLocale() is C
main: QTextCodec::codecForLocale() is UTF-8, mib 106
Nothing to do!  Use '/snap/gpsbabel/x2/usr/bin/gpsbabel -h' for command-line options.
noyannus commented 3 months ago

Works great. A typical input field containing ; (separating previously merged fields) produces:

<desc>Baker Street︂; Southbury︂; London Borough of Enfield︂; London︂; United Kingdom</desc>

\ts within fields are also preserved. Not that QMapShack or other apps could do much with them... :-)

A \n inside a field makes it hang with error: CSV_UTIL: Warning- Unbalanced Field Enclosures "\"" on line 2 or an Invalid Latitude error, but that's to be expected. Newlines would require a special treatment throughout the processing.

I think the origial issue is resolved perfectly. :+1: Thank you!

tsteven4 commented 3 months ago

You are correct we don't handle newlines in fields. The provision in RFC4180 (2.6) for enclosing CRLF sequences in fields is not implemented, and it is not likely we will ever implement it.

noyannus commented 3 months ago

Well, QMapShack does not honor newlines in imported waypoints, so for me there is no use in having them.[*]

As a general idea for a time of boredom :-), you could use a temporary placeholder for \n while gpsbabel does its magic, and after completion change it back to \n. It could be any character exotic enough to be unlikely to ever show up in a waypoint.

To even eliminate this risk, the script I call gpsbabel with uses 0xFFFC, the replacement character, rendered or in most fonts as space. Commands may want it in different representations, e.g., awk accepts it as \357\277\274 (...printf "%s\357\277\274%s\"...) while sed in the same script wants it literally (...sed -E -e "s//$VARIABLE/g"...). Or I made a dumb error here. Python may want something different; maybe one of these:

UTF-8: 0xEF 0xBF 0xBC
UTF-16: 0xFFFC
C octal escaped UTF-8: \357\277\274
XML decimal entity: &#65532;

[*] For applications that support newlines, it makes good looks. Below, the comment is a nicely line-broken street address: grafik