aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.84k stars 678 forks source link

Double carriage return when using \r\n as line terminator #2853

Open davidcava opened 2 weeks ago

davidcava commented 2 weeks ago

Describe the bug

It seems the old issue Issue 415 is back: Passing in \r\n as the line terminator when calling to_csv yields a CSV file with an extra carriage return.

How to Reproduce

See Issue 415

Expected behavior

See Issue 415

Your project

No response

Screenshots

No response

OS

AWS Lambda

Python version

3.12

AWS SDK for pandas version

3.8.0

Additional context

Using layer arn:aws:lambda:eu-west-1:336392948345:layer:AWSSDKPandas-Python312-Arm64:9

LeonLuttenberger commented 2 weeks ago

Hey,

Apparently this is a limitation in Pandas, where they explicitly don't support \r\n as a line separator any more: https://github.com/pandas-dev/pandas/blob/de5d7323cf6fcdd6fcb1643a11c248440787d960/pandas/_libs/parsers.pyx#L440.

When I try to read a local file that uses \r\n as a separator using pandas I get the following error:

ValueError: Only length-1 line terminators supported

If you found a workaround for this, we can look into taking advantage of it. But right now, we can't implemented something that pandas itself doesn't support.

davidcava commented 6 days ago

Hi

ok I have made some basic tests with Pandas, both 2.2.2 and an old version (no difference noted): The limitation of 1-character line terminator only exists in read_csv, not to_csv. In read_csv this is not an issue because when you do not provide lineterminator, Pandas accommodates whatever it finds (tested with LF, CRLF and even CR only). On the other hand when you want to write a CSV with a precise line terminator, then you need lineterminator. For this to_csv definitely accepts lineterminator='\r\n', without the bug of adding unexpected double-CRs.