Add conveniant support for stdin

jbdesbas commented 1 year ago

Hi,

I just discovered this great project, thanks a lot for this amazing work :smiley:

Since CSV processing usually occurs in data flow process, it would be great to improve conveniency as reading CSV data through stdin.

Writing to stdout is easy already, because sys.stdout is passed directly to csv.writer , but reading is a bit more tricky.

import io
from sys import stdout, stdin

import clevercsv
import chardet

# read
input_data = stdin.buffer.read() # Read as binary
detected_encoding = chardet.detect(input_data)['encoding'] # Guess encoding

csvfile = io.StringIO(input_data.decode(detected_encoding))

dialect = clevercsv.Sniffer().sniff(csvfile.read())
csvfile.seek(0)

reader = clevercsv.reader(csvfile, dialect)
rows = reader

# write
writer = clevercsv.write.writer(sys.stdout, encoding='utf8')
writer.writerows(rows)

GjjvdBurg commented 1 year ago

Hi @jbdesbas, thanks for the kinds words and for opening this issue! What exactly do you have in mind for the functionality that we can add to CleverCSV to make this easier? A wrapper function perhaps that returns dicts or rows of the CSV file similar to stream_table and stream_dicts (or modification of these to accept sys.stdin)?

Note that the example you shared is very similar to the standardize command in the CLI. If that command is what you're looking for, issue #107 could capture your request too (please let me know).

jbdesbas commented 1 year ago

Hi @GjjvdBurg Yes, I think read/stream table accepting sys.stdin instead of just filename would be a great improvement. :+1:

My need is sligthy different that the standardize command do : standardize keep the original encoding for the output file, but I need an UTF8 file as output (regardless of orignal encoding). Additionally, my original script do other stuff between reading and writing (add suffix in order to deduplicate columns names). However, standardize should accept stdin as input too.

lisad commented 5 months ago

I would have used this too. I'm trying to wrap or adapt some part of this library and add the ability to remove completely empty lines or lines of only commas (frequently get added at the end of an Excel table) from the file before trying to turn lines into dicts,. It seems less reliable to detect a line of only commas after the line has been parsed into a dict and I have to check each value. I'm also adding logic to detect duplicate column names BEFORE turning rows into dicts.

Things I've tried or thought of:

Try to process lines of a file before passing them into stream_dicts -- doesn't work because stream_dicts does not take any data formats besides a filename
Try to write my own version of "stream_dicts" that calls DictReader directly -- doesn't work because DictReader needs to know the encoding, and the get_encoding method is not exported for me to use
Open the file, read through it all, save it again stripping empty rows, THEN call stream_dicts on the new file? could do it but then need file write permissions and it takes longer
Write my own version of stream_dicts that copies code from clevercsv, opens the file itself, uses a generator to not pass on empty lines, then uses DictReader... crossing my fingers that I don't run into problems since I skipped the encoding detection because it's not exported.

Although accepting other IO to stream_dicts (etc) besides a file would open this up enough for me to fix my problem in an easier way, so would making get_encoding an exported part of the library, or a number of other things.

alan-turing-institute / CleverCSV

Add conveniant support for stdin #117