br1ghtyang / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

add CSV serialization #548

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
One proposal:

  convert-to-csv($record, $options)

$record is any ADM record, $options is an options record of type

  type ConvertToCSVOptions as {
    fields : [ string ],
    fs : string?
  }

The "fields" field of $options is the field specification and the "fs" field of 
$options is the field separator.
The result of the function is a string. It contains
- the serialized values of the fields identified by the strings in the field 
specification
- in the order given by the field specification
- separated by the value of the field separator.
If $options does not contain a field separator, the values are separated by ",".
If $record does not contain a field contained in the field specification there 
is no character between field separators in the result.
If the type of a field is not boolean, string, intXX, float, double, Date, 
Time, Datetime, or Duration an error is raised.
If a field is a string, it is enclosed in quotes. Quotes inside the string are 
escaped by double quotes ("").
If a field is a Date, Time, Datetime, or Duration is is formatted using the 
extended format of ISO8601 (which should be the same as XML Schema).
If in doubt, follow http://tools.ietf.org/html/rfc4180.

We also should make sure that the serialization of numbers follows usual 
conventions (no unusual suffixes).
I think that we should do that in general, but if we don't we should do it for 
this function.

There is no provision of a record separator as I think that newlines are always 
inserted during serialization.

This version seems quite restrictive, but I think it should be sufficient for 
now and it can be extended later.

Original issue reported on code.google.com by westm...@gmail.com on 3 Jul 2013 at 12:38

GoogleCodeExporter commented 8 years ago
The fields in the field specification only contain simple field names - no 
paths - such that only the "flat" part of a record can be converted to CSV.

Original comment by westm...@gmail.com on 3 Jul 2013 at 6:37

GoogleCodeExporter commented 8 years ago
If the value of an existing field is null, it is treated like a non-existing 
filed.

Original comment by westm...@gmail.com on 3 Jul 2013 at 6:37

GoogleCodeExporter commented 8 years ago
Instead of the function, we just decided to set a parameter:

set serialization "csv(['id','name'], ',')"

and let the framework do the serialization.

Original comment by westm...@gmail.com on 25 Jul 2013 at 12:20

GoogleCodeExporter commented 8 years ago

Original comment by westm...@gmail.com on 30 Jul 2014 at 6:45

GoogleCodeExporter commented 8 years ago
I'm not clear about the relationship between the specification of serialization 
options in the query and the use of the content type in the http interface. It 
seems that we should have a design/plan here (and I'd be happy to hear that 
there already is one :) ).

Original comment by westm...@gmail.com on 30 Jul 2014 at 4:30

GoogleCodeExporter commented 8 years ago

Original comment by westm...@gmail.com on 30 Jul 2014 at 4:31

GoogleCodeExporter commented 8 years ago
We still need one!  One option could be related to having headers in CSV 
files, perhaps?
(To indicate the names?)  Design thought needed!!

Original comment by dtab...@gmail.com on 30 Jul 2014 at 5:09

GoogleCodeExporter commented 8 years ago
I agree that we need the serialization, but it's not clear to me how the user 
should tell the systems and how we can manage or avoid conflicts. Assume e.g. 
that the user sets an option to serialize to CSV in the query and then somebody 
talks to the http interface and requests "application/json". Who's right, the 
query writer or the result consumer? It seems to me that this "right" way would 
be for the query writer to decide what the resulting ADM instance is and for 
the result consumer to decide what the serialization should look like. But 
maybe there are other opinions ...

Original comment by westm...@gmail.com on 30 Jul 2014 at 6:02

GoogleCodeExporter commented 8 years ago
Good point.  And there is no way in AQL/ADM land to be sufficiently 
prescriptive, anyway, since we have a set view of records in terms of 
fields and their order (or lack thereof).  +1

Original comment by dtab...@gmail.com on 30 Jul 2014 at 6:25

GoogleCodeExporter commented 8 years ago
I've implemented an output method akin to the JSON serialization. It currently 
does the basics - it can display strings, numbers, and a couple other types as 
valid CSV, and it detects and throws exceptions for things that cannot be 
represented as CSV such as list values and nested records.

The big missing link at the moment is a header; I'm not actually sure where to 
put that logic, without risking getting multiple headers if the query 
processing is spread across multiple NCs.

Also, it needs to detect the situation where the records do not follow the same 
schema. That at least I believe should be fairly straightforward.

Original comment by c...@lambda.nu on 7 Nov 2014 at 11:39

GoogleCodeExporter commented 8 years ago
Maybe formatting could be viewed/modeled kind of like aggregation 
operations?
In this case think of average - the nodes compute the "partial sums and 
counts" (unheaded CSV) and then the final node would receive those and 
compute the "final average" by cat'ing and head'ing the partial results?

Original comment by dtab...@gmail.com on 7 Nov 2014 at 3:07