Open GoogleCodeExporter opened 8 years ago
The import and export sides sound like two independent problems, but I'll have
a look at both.
Original comment by tfmorris
on 10 Jun 2011 at 10:30
David told me on the mailing-list that the export problem is actually not a
problem but a feature of CSV/TSV files. See
http://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules.
Original comment by bad...@gmail.com
on 10 Jun 2011 at 11:45
Reassigning to Iain for review since he did the original "ignore quotes"
implementation. He should be more familiar with the intended behavior.
We're apparently running a private patched version of OpenCSV 2.2. OpenCSV 2.3
has been released since then, but didn't include the patch.
Original comment by tfmorris
on 11 Jun 2011 at 10:31
Ignore quotes affects how the parser treats the separator character only, it
doesn't stop the parser chomping the quotation marks.
e.g.: For the line:
hello", world"
With Ignore Quotation Marks set to false the parser would return one token:
hello, world
With Ignore Quotation Marks set to true the parser would return two tokens:
hello
world
In both cases the quotations are chomped which is the correct behaviour.
The preservation of quotation marks is a separate feature.
I'm not sure if OpenCSV has a way to preserve quotation marks. (I'll take a
look). If so (or it's something I can add easily), would the preservation of
quotations be a feature that we would like to see on the importer page?
With regards to our private patched version of OpenCSV 2.2; the patch is now in
the OpenCSV trunk and will be included in version 2.4 (I'm not sure of the
release date of that though)
https://sourceforge.net/tracker/?func=detail&aid=3018599&group_id=148905&atid=77
3543
Until OpenCSV 2.4 is released I think it may be better practice to build
openCSV from its current trunk (as it will have all the improvements in the 2.3
release as well - what those are, I'm not too sure I can't find a changelog!)
and use that, rather than using our branched 2.2. Any objections to doing this?
Original comment by iainsproat
on 13 Jun 2011 at 2:21
> would the preservation of quotations be a feature that we would like to see
on the importer page?
I think it'd be better that whatever is generating the files to begin with are
fixed to conform to the CSV "standard" (pick one). That is, wrap fields in
quotes and double up quotes in the data.
Original comment by paulm%pa...@gtempaccount.com
on 13 Jun 2011 at 2:31
The more I learn about this, the less I'm inclined to continue down this
convoluted path of special casing things. I think we'd be better off just
adding better documentation about what "ignore quotes" does and pointing people
at documentation on how to create well-formed CSV files.
As for using the OpenCSV trunk, that seems risky to me because there's no
telling how stable they keep their trunk. We could reapply Iain's patch to
2.3, but unless 2.3 has bug fixes we need, that may be more trouble than it's
worth. (Whatever path we choose, we should make sure opencsv-sources.jar
matches what we use -- I got very confused during debug when it didn't contain
the constructor we were calling)
Original comment by tfmorris
on 13 Jun 2011 at 4:40
>As for using the OpenCSV trunk, that seems risky to me because there's no
telling how stable they keep their trunk.
Thinking about it, the patched "2.2" version was (if I recollect correctly)
built off of the trunk HEAD revision at the time (June 2010)....
Original comment by iainsproat
on 13 Jun 2011 at 5:57
My opinion on this issue is that the import mechanism of Refine should be made
similar to the export mechanism, ie allow importing formats and not what
appears to be a self-written splitting method (which I understood it isn't).
Using formats allow to point to the documentation of [CTP]SV and require the
import files to be compatible with the standard.
Original comment by bad...@gmail.com
on 14 Jun 2011 at 2:40
> require the import files to be compatible with the standard.
I think we should expect import files to be for some part non-compatible with
standards as part of the definition of "messy data". (obviously the xml or
json parsers will be more likely to choke on non-compatible formats than with
csv)
There's some very large changes on the way with the importer UI, which I hope
will help greatly with importing data. I like the idea of having small bits of
inline documentation though, perhaps as a tooltip. I'm not totally sure what
David has in his revised importer UI though. (David?)
Original comment by iainsproat
on 14 Jun 2011 at 2:53
Refine is a clean up tool. Built for cleaning even non-standard formats. That
includes CSV and it's variations outside of the pseudo standard
http://tools.ietf.org/html/rfc4180 There is strict rfc4180 and non-strict and
Refine allows for both, any, and all text formats to be dealt with. The method
of handling quoted strings and separated data fields by various ways is handled
quite well now actually in my opinion. Sometimes the separation and cleanup
can be handled after the import. But, I do agree with Paul that we should
adhere to whatever agreed upon standard for handling double-quotes, single
quotes or what someprograms just call "text-qualified" fields as noted in all
the variations of CSV formatting here: http://www.csvreader.com/csv_format.php
Pick the method handling for quotes, stick to it throughout, and document the
hell out of it so users are not confused.
Original comment by thadguidry
on 14 Jun 2011 at 3:11
Hi Iain, I've been mostly working on the plumbing and not so much on the
details like tooltips. Perhaps you could check out the branch new-importer-ui
and see what hints are missing? From chatting offline with Thad, he and I think
we have all the importer levers now for at least the formats TSV/CSV/*SV, JSON,
XML.
I'm attaching some screenshots to show the development so far.
Original comment by dfhu...@gmail.com
on 15 Jun 2011 at 5:42
Attachments:
Original comment by tfmorris
on 18 Sep 2012 at 5:49
Original comment by tfmorris
on 18 Sep 2012 at 5:52
Original issue reported on code.google.com by
bad...@gmail.com
on 10 Jun 2011 at 9:34Attachments: