alan-turing-institute / CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
https://clevercsv.readthedocs.io
MIT License
1.24k stars 70 forks source link

Add conveniant support for stdin #117

Open jbdesbas opened 9 months ago

jbdesbas commented 9 months ago

Hi,

I just discovered this great project, thanks a lot for this amazing work :smiley:

Since CSV processing usually occurs in data flow process, it would be great to improve conveniency as reading CSV data through stdin.

Writing to stdout is easy already, because sys.stdout is passed directly to csv.writer , but reading is a bit more tricky.

import io
from sys import stdout, stdin

import clevercsv
import chardet

# read
input_data = stdin.buffer.read() # Read as binary
detected_encoding = chardet.detect(input_data)['encoding'] # Guess encoding

csvfile = io.StringIO(input_data.decode(detected_encoding))

dialect = clevercsv.Sniffer().sniff(csvfile.read())
csvfile.seek(0)

reader = clevercsv.reader(csvfile, dialect)
rows = reader

# write
writer = clevercsv.write.writer(sys.stdout, encoding='utf8')
writer.writerows(rows)
GjjvdBurg commented 9 months ago

Hi @jbdesbas, thanks for the kinds words and for opening this issue! What exactly do you have in mind for the functionality that we can add to CleverCSV to make this easier? A wrapper function perhaps that returns dicts or rows of the CSV file similar to stream_table and stream_dicts (or modification of these to accept sys.stdin)?

Note that the example you shared is very similar to the standardize command in the CLI. If that command is what you're looking for, issue #107 could capture your request too (please let me know).

jbdesbas commented 9 months ago

Hi @GjjvdBurg Yes, I think read/stream table accepting sys.stdin instead of just filename would be a great improvement. :+1:

My need is sligthy different that the standardize command do : standardize keep the original encoding for the output file, but I need an UTF8 file as output (regardless of orignal encoding). Additionally, my original script do other stuff between reading and writing (add suffix in order to deduplicate columns names). However, standardize should accept stdin as input too.

lisad commented 3 weeks ago

I would have used this too. I'm trying to wrap or adapt some part of this library and add the ability to remove completely empty lines or lines of only commas (frequently get added at the end of an Excel table) from the file before trying to turn lines into dicts,. It seems less reliable to detect a line of only commas after the line has been parsed into a dict and I have to check each value. I'm also adding logic to detect duplicate column names BEFORE turning rows into dicts.

Things I've tried or thought of:

Although accepting other IO to stream_dicts (etc) besides a file would open this up enough for me to fix my problem in an easier way, so would making get_encoding an exported part of the library, or a number of other things.