Closed GoogleCodeExporter closed 9 years ago
This is a core function that is holding up the work of the ALA4R project so I'm
having a go at upping the priority :)
Original comment by leebel...@gmail.com
on 21 Mar 2014 at 2:07
theres a few issues listed here. Which are actually holding things up ? Most of
the issues seem minor enhancements to improve performance of the R package on
first glance.
Original comment by moyesyside
on 21 Mar 2014 at 2:41
The principal problem is that quotes aren't guaranteed to be balanced
which means the files can't be guaranteed to be parsed properly. Minimum
solution = make sure all quotes are escaped.
Everything else is indeed "enhancement" and while these might look/sound
minor, they are having a big impact on performance in R.
Backslash-escaping quotes, avoiding quotes where not needed (e.g.
numeric values don't need to be quoted), and stripping line breaks from
inside text fields will make R life considerably more pleasant!
Original comment by antarcti...@gmail.com
on 21 Mar 2014 at 5:21
[deleted comment]
I have implemented the following fixes:
1) The citation.csv is now valid CSV
2) Support for extra params to the download services sep and esc which will
specify the CSV field separator character to use and the CSV escape character
to use
For example the following URL will produce a tab separated file that uses the \
character to escape:
http://biocache-test.ala.org.au/ws/occurrences/index/download?q=data_resource_ui
d:dr364&reasonTypeId=10&sep=%09&esc=\
The sep and esc setting are obeyed in the data.csv and citation.csv file.
Hope that this solution will suit the needs.
Original comment by natasha....@csiro.au
on 3 Apr 2014 at 4:12
This in on prod. Marking as fixed.
Original comment by moyesyside
on 17 Apr 2014 at 4:57
Apparently not quite fixed. This in from Jeremy today
I spent some 90min trying to get read.csv (and alternatives) to bring in the
citation file (with \r\n embedded within quotes) and this was a fail... we will
have to ask ALA to remove prior to send it out.
Original comment by leebel...@gmail.com
on 17 Apr 2014 at 7:56
Assuming the service is providing valid CSV but this is a problem for R, then
perhaps try some of the citation services for resources:
http://collections.ala.org.au/ws/citations/dr359.json
http://collections.ala.org.au/ws/citations/dr359
Original comment by moyesyside
on 17 Apr 2014 at 12:21
please try and open e.g., citation.csv that contains \r\n embedded within
quotes in something like excel; it fails there too demonstrating R is not the
issue.
Original comment by JJVanDerWal
on 17 Apr 2014 at 12:36
to follow my last comment... the citation.csv that I am looking at fails at UID
of dr668.
Original comment by JJVanDerWal
on 17 Apr 2014 at 12:45
Attachments:
Our CSV readers have no problems with reading the attached citation.csv.
Not sure what we should do...
Original comment by natasha....@csiro.au
on 22 Apr 2014 at 12:08
Is it particularly difficult (or slow) to strip embedded newlines on the
fly? If not, perhaps we can add a "stripnewlines" parameter to the
service (which would presumably be off by default, to maintain
backwards-compatibility).
Do we know if the newlines definitely only appear in the citations.csv
file (not the actual data.csv file)? I've only noticed them in the
citation file, but maybe they can potentially occur in the data.csv file
as well. If it is *only* the citations.csv file, then on-the-fly newline
stripping on the server side shouldn't be all that slow.
If it is utterly impossible to do it on the server side, it might be
possible to do it client-side for the citations.csv file (but it'll
require unzipping the file, reading citations, modifying it, re-writing
it, and re-reading it, so it won't be fast).
If the data.csv file can also potentially contain newlines within
fields, then that could be quite problematic because the files are so
much larger.
(By the way, I totally acknowledge that this is exacerbated by R's
somewhat-idiosyncratic handling of CSV's ... but that's all we've got to
work with!)
Original comment by antarcti...@gmail.com
on 22 Apr 2014 at 2:04
what csv readers are you using?
R is not the only program that has "somewhat-idiosyncratic handling of CSV's";
I have tested R, excel, open office and google docs -- all fail on the
citation.csv I uploaded the other day. See excel screenshot attached.
following on Ben's queries...
1. are there any newlines embedded in the data file (not just the citation.csv)?
2. can these be stripped server side?
Original comment by JJVanDerWal
on 22 Apr 2014 at 8:52
Attachments:
We are using the Java open CSV reader (http://opencsv.sourceforge.net/) which
allows for embedded newline characters with different escape characters to
quotes.
Excel requires embedded quote characters to be quoted. The attached file I
have changed the esc character to be a ". It opens correctly in Excel. Can
this file be correctly read by R?
The escape character can be specified by providing an esc param:
http://biocache.ala.org.au/ws/occurrences/index/download?q=data_resource_uid:dr6
68&esc="&reasonTypeId=10
What are the restrictions on the R CSV reader?
Also to answer your questions:
1) There could be embedded new line characters in some of the raw data that we
allow to be downloaded.
2) Stripping them at the server would require a change. It would be nice to be
able to get the citations.csv into a format that the R CSV reader can use.
Original comment by natasha....@csiro.au
on 22 Apr 2014 at 10:50
Attachments:
thanks Natasha, I can confirm that now reads correctly in R.
I am still concerned about new line characters in the raw data... but this may
deal with it???
Thanks again for your help with this issue.
Original comment by JJVanDerWal
on 23 Apr 2014 at 1:45
Closing this issue. Source data with newline characters is something that may
need to be addressed in the future. If this is the case a new issue should be
opened.
Original comment by natasha....@csiro.au
on 23 Apr 2014 at 1:54
Another thing to note if you don't provide an esc value it will default to an
Excel readable form. Default values for params:
esc="
sep=,
You only need to provide these params if you need to override them.
Original comment by natasha....@csiro.au
on 23 Apr 2014 at 2:03
Original issue reported on code.google.com by
antarcti...@gmail.com
on 24 Feb 2014 at 5:49