capitalone / giraffez

User-friendly Teradata client for Python
https://capitalone.github.io/giraffez
Apache License 2.0
108 stars 31 forks source link

BulkExport.to_str() writes when delimiter is present in table #45

Open theianrobertson opened 6 years ago

theianrobertson commented 6 years ago

If I have a table with a pipe (|) in a character field and try to export and write to a string, there's no checking for whether the delimiter exists in any of the fields:

import giraffez

sql = """SELECT * FROM 
(
SELECT
CAST('first|field' AS VARCHAR(20)) AS first_char,
CAST('second field' AS VARCHAR(20)) AS second_char
) AS tbl
"""

with giraffez.BulkExport(sql) as export:
    r = list(export.to_str(delimiter='|'))

print(r)
['first|field|second field']

I could bring it out with a different delimiter, or roll my own to_str that takes in a dict from to_dict/to_list, but wondering if there should be functionality pushed up into to_str to error out, warn, or remove the field if the delimiter exists in a column?

ChrisRx commented 6 years ago

Checking the delimiter for every byte of data would introduce a very significant performance change to cover this case. While I agree that it would be nice to find that the source data has a character that is being used for a delimiter, but I think for the most part this is why the to_list and to_dict exists so that in the case that this behavior is desired they can be used to produce those results with a higher level of safety (i.e. use to_list and pass to Python's csv module). Since to_str is building a buffer in C and returning that as text it would be a fairly large change to try and introduce the kind of logic that may address some of these concerns like a quote character. This was not originally in scope for this because to_str is suppose to be fast and straight-forward, and I also wish to discourage the use of character delimited formats altogether because they are not portable or safe to use in far too many cases, vs. something like JSON which fully supports unicode, is self-describing of all fields, and is portable regardless of OS/CPU Architecture.

I think it is possible to better abstract this so that it is either more apparent how to solve issues like in your example, or to just provide a better interface that would solve for this automatically. For example, to_str is still available but under the hood it just calls to_list and uses something like csv rather than providing the other option that builds the string in C. Thoughts?