Closed martindrapeau closed 6 years ago
Hi Martin,
I have read the text you added for the new tool - here's my preferred version for you to consider:
About the CSVJSON format (variant of CSV):
CSVJSON is a CSV-like text format where each line is a JSON array without the surrounding brackets.
For data made of numbers and 'simple' strings, CSVJSON looks just like CSV.
Parsing CSVJSON is done by processing one line at a time. Wrap a line with square brackets [] and use JSON.parse() to convert to a JSON array.
An explanation of CSVJSON and its benefits can be found at the specification website: csvjson.org
CSVJSON is ideal as a common format for dumping database tables because.
Being based on UTF-8 it can reliably maintain text from different
languages.
It has a standard concept of nulls.
It can deal with modern database features like objects and arrays.
Being based on JSON, there is large variety of high quality formatters
and parsers in virtually every programming language.
CSVJSON is more expressive than CSV (whose common use is documented by RFC-4180). As a result, there are many cases where products and libraries that can read CSV would fail to read CSVJSON due, for example, to escaping rules and embedded objects. Given CSVJSON's simplicity and utility more tools and libraries will support it over time.
I have already added samples at csvjson.org and will also add the use case above there. Many thanks for the support.
I tried the JSON2CSV with the following JSON while checking the CSVJSON box:
[{ "index": "string with bell&newlines", "value1": "bell is \u0007", "value2": "multi\nline\ntext" }]
and got:
"index","value1","value2" "string with bell&newlines","bell is \u0007","multi line text"
where it should have been:
"index","value1","value2" "string with bell&newlines","bell is \u0007","multi\nline\ntext"
The way to produce CSVJSON from JSON, in my opinion, is for every line, to collect the values into an array, serialize it to string (minified) and write out the result from the 2nd character to one before the last and then a newline.
BTW, I work at Attunity where we create database replication tools and these days much is involved with big data where CSV is very popular despite all of it 'sloppiness'. Trying to get CSV right without customers complaing about mismatched lines in multi-GB files due to embedded newlines (for example) was one of my motivations...
/d
On Thu, Apr 5, 2018 at 3:17 PM, Martin Drapeau notifications@github.com wrote:
It would be nice for you to document some of the situations where you use the CSVJSON format and provide some samples. Others could benefit from that.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DrorHarari/csvjson/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvz3e_qalsZdCN6CVlbGh1s4zmQSbQRks5tlgtngaJpZM4TIWWD .
Hi Dror,
I fixed that reported bug, and modified the explanation text with yours. I did leave a good quote of yours. I like the human touch it provides to the new format.
Have a look and let me know if all is good. I will then close the issue in my repo.
I would encourage you to add a paragraph in your spec about your experience at Attunity and why the format was created. Alternatively, you could do it as a blog post. I my mind its important to share the motivation from a more personal level.
--Martin
Hi Dror, I reference the CSVJSON format in an article I wrote. Feel free to provide me comments. https://medium.com/@martindrapeau/the-state-of-csv-and-json-d97d1486333 --Martin martindrapeau@gmail.com
Hi Martin - great article, made some comments there. I think people will find it very useful.
BTW, until database vendors won't adopt the CSVJSON variant of CSV, there won't be much use for parsers. I have put this idea in the open because it can really solve inter-operability challanges with database export and data generated by big data processes.
One frustrating example was taking CSV-formatted exported data from Teradata to beloaded into Oracle. We invested more than a week to iron out incompatibilities between the CSV format Teradata generates and Oracle reads (some of it includes replacing characters at the SQL level in the source Teradata database - something that harmed performance). Without CSVJSON this work needs to be done per pair of databases. With CSVJSON, there's just one format to rule them all - a format that looks very much like CSV and with similar storage and process patterns.
I'll find the time to link some of this information from the main page.
Thanks,
/d
It would be nice for you to document some of the situations where you use the CSVJSON format and provide some samples. Others could benefit from that.