bsed / ala

Automatically exported from code.google.com/p/ala
0 stars 0 forks source link

Issues with occurrence download csv data #595

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
http://biocache.ala.org.au/ws/occurrences/download

- citation.csv sometimes has unbalanced quotes which is preventing successful 
parsing of the file (e.g. the "Paul Daves "Littlewood" in  "dr356","Littlewood 
Species List","Paul Daves "Littlewood","Creative Commons 
Attribution-NonCommercial 2.5 Australia (CC BY-NC)","For more information: 
http://collections.ala.org.au/public/show/dr356","","","","1"

- data.csv has embedded quotes that seem unneccessary (e.g. in the "Taxon 
identification issue" column has entries like "[""noIssue""]", quoted person 
names, quoted placenames, etc). 

- suggest that embedded quotes are avoided unless absolutely necessary, and 
where needed, can they be backslash-escaped (i.e. \") rather than double-quotes 
(""). This might seem really minor but makes a big difference to parsing the 
data in R.

- citation.csv and occasionally data.csv has line breaks embedded in the 
fields. Allowing this in parsing slows life down by a lot. Are they necessary?

- everything in data.csv and citation.csv is quoted, even numeric values. This 
is unneccesary and a bit inefficient, but more importantly may be exacerbating 
the above issues

(As a comparison, a relatively-small 67000 row test set parses in 0.5s with no 
line breaks and backslash-escaped quotes. With "" quotes and line breaks it 
takes 5.3s)

Original issue reported on code.google.com by antarcti...@gmail.com on 24 Feb 2014 at 5:49

GoogleCodeExporter commented 9 years ago
This is a core function that is holding up the work of the ALA4R project so I'm 
having a go at upping the priority :)

Original comment by leebel...@gmail.com on 21 Mar 2014 at 2:07

GoogleCodeExporter commented 9 years ago
theres a few issues listed here. Which are actually holding things up ? Most of 
the issues seem minor enhancements to improve performance of the R package on 
first glance.

Original comment by moyesyside on 21 Mar 2014 at 2:41

GoogleCodeExporter commented 9 years ago
The principal problem is that quotes aren't guaranteed to be balanced 
which means the files can't be guaranteed to be parsed properly. Minimum 
solution = make sure all quotes are escaped.

Everything else is indeed "enhancement" and while these might look/sound 
minor, they are having a big impact on performance in R. 
Backslash-escaping quotes, avoiding quotes where not needed (e.g. 
numeric values don't need to be quoted), and stripping line breaks from 
inside text fields will make R life considerably more pleasant!

Original comment by antarcti...@gmail.com on 21 Mar 2014 at 5:21

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I have implemented the following fixes:

1) The citation.csv is now valid CSV
2) Support for extra params to the download services sep and esc which will 
specify the CSV field separator character to use and the CSV escape character 
to use

For example the following URL will produce a tab separated file that uses the \ 
character to escape:
http://biocache-test.ala.org.au/ws/occurrences/index/download?q=data_resource_ui
d:dr364&reasonTypeId=10&sep=%09&esc=\

The sep and esc setting are obeyed in the data.csv and citation.csv file.

Hope that this solution will suit the needs.

Original comment by natasha....@csiro.au on 3 Apr 2014 at 4:12

GoogleCodeExporter commented 9 years ago
This in on prod. Marking as fixed.

Original comment by moyesyside on 17 Apr 2014 at 4:57

GoogleCodeExporter commented 9 years ago
Apparently not quite fixed. This in from Jeremy today

I spent some 90min trying to get read.csv (and alternatives) to bring in the 
citation file (with \r\n embedded within quotes) and this was a fail... we will 
have to ask ALA to remove prior to send it out.

Original comment by leebel...@gmail.com on 17 Apr 2014 at 7:56

GoogleCodeExporter commented 9 years ago
Assuming the service is providing valid CSV but this is a problem for R, then 
perhaps try some of the citation services for resources:

http://collections.ala.org.au/ws/citations/dr359.json
http://collections.ala.org.au/ws/citations/dr359

Original comment by moyesyside on 17 Apr 2014 at 12:21

GoogleCodeExporter commented 9 years ago
please try and open e.g., citation.csv that contains \r\n embedded within 
quotes in something like excel; it fails there too demonstrating R is not the 
issue. 

Original comment by JJVanDerWal on 17 Apr 2014 at 12:36

GoogleCodeExporter commented 9 years ago
to follow my last comment... the citation.csv that I am looking at fails at UID 
of dr668.

Original comment by JJVanDerWal on 17 Apr 2014 at 12:45

Attachments:

GoogleCodeExporter commented 9 years ago
Our CSV readers have no problems with reading the attached citation.csv.

Not sure what we should do...

Original comment by natasha....@csiro.au on 22 Apr 2014 at 12:08

GoogleCodeExporter commented 9 years ago
Is it particularly difficult (or slow) to strip embedded newlines on the 
fly? If not, perhaps we can add a "stripnewlines" parameter to the 
service (which would presumably be off by default, to maintain 
backwards-compatibility).

Do we know if the newlines definitely only appear in the citations.csv 
file (not the actual data.csv file)? I've only noticed them in the 
citation file, but maybe they can potentially occur in the data.csv file 
as well. If it is *only* the citations.csv file, then on-the-fly newline 
stripping on the server side shouldn't be all that slow.

If it is utterly impossible to do it on the server side, it might be 
possible to do it client-side for the citations.csv file (but it'll 
require unzipping the file, reading citations, modifying it, re-writing 
it, and re-reading it, so it won't be fast).

If the data.csv file can also potentially contain newlines within 
fields, then that could be quite problematic because the files are so 
much larger.

(By the way, I totally acknowledge that this is exacerbated by R's 
somewhat-idiosyncratic handling of CSV's ... but that's all we've got to 
work with!)

Original comment by antarcti...@gmail.com on 22 Apr 2014 at 2:04

GoogleCodeExporter commented 9 years ago
what csv readers are you using? 
R is not the only program that has "somewhat-idiosyncratic handling of CSV's"; 
I have tested R, excel, open office and google docs -- all fail on the 
citation.csv I uploaded the other day. See excel screenshot attached. 

following on Ben's queries...
1. are there any newlines embedded in the data file (not just the citation.csv)?
2. can these be stripped server side?

Original comment by JJVanDerWal on 22 Apr 2014 at 8:52

Attachments:

GoogleCodeExporter commented 9 years ago
We are using the Java open CSV reader (http://opencsv.sourceforge.net/) which 
allows for embedded newline characters with different escape characters to 
quotes.

Excel requires embedded quote characters to be quoted.  The attached file I 
have changed the esc character to be a ". It opens correctly in Excel.  Can 
this file be correctly read by R?

The escape character can be specified by providing an esc param:
http://biocache.ala.org.au/ws/occurrences/index/download?q=data_resource_uid:dr6
68&esc="&reasonTypeId=10

What are the restrictions on the R CSV reader?

Also to answer your questions:
1) There could be embedded new line characters in some of the raw data that we 
allow to be downloaded. 

2) Stripping them at the server would require a change. It would be nice to be 
able to get the citations.csv into a format that the R CSV reader can use.

Original comment by natasha....@csiro.au on 22 Apr 2014 at 10:50

Attachments:

GoogleCodeExporter commented 9 years ago
thanks Natasha, I can confirm that now reads correctly in R.

I am still concerned about new line characters in the raw data... but this may 
deal with it???

Thanks again for your help with this issue.

Original comment by JJVanDerWal on 23 Apr 2014 at 1:45

GoogleCodeExporter commented 9 years ago
Closing this issue. Source data with newline characters is something that may 
need to be addressed in the future.  If this is the case a new issue should be 
opened. 

Original comment by natasha....@csiro.au on 23 Apr 2014 at 1:54

GoogleCodeExporter commented 9 years ago
Another thing to note if you don't provide an esc value it will default to an 
Excel readable form. Default values for params:
esc="
sep=,

You only need to provide these params if you need to override them.

Original comment by natasha....@csiro.au on 23 Apr 2014 at 2:03