Exclude header option - Githubissues

hmaddocks commented 3 years ago

TLDR; A command option to exclude the header line for CSV files that use headers incorrectly.

My apologies if this is a duplicate. I did search the issues but couldn't find anything. I work in an industry where CSV is the primary means for transferring data between industry participants. The irony being ALL the CSV files are invalid. The first line of the files don't describe the columns, they contain meta data. This causes almost all xsv commands to fail with "found record with 64 fields, but the previous record has 7 fields" errors. My proposal is to add a skip-header option. An enhancement to this could be to add dummy header labels possibly using the common spreadsheet method of using alpha labels. I know the fixlengths command offers a workaround, but it results in some not so pleasant output for other commands. I'm prepared to have a crack at this myself but first is there an appetite for adding this functionality?

BurntSushi commented 3 years ago

Could you please provide some sample input along with some commands and how you would hope they work?

For example, your description isn't totally clear to me. It sounds like you just want to skip the first line and not the header.

hmaddocks commented 3 years ago

It sounds like you just want to skip the first line

Yes, that is what I want. Here's a sample.

HDR,RSICPLIST,RGST,ALPE,22/09/2015,15:54:12,00034524 DET,0000000000AL6A2,01/05/2006,01/05/2006,01/11/2014,22/09/2015,NET-6600778,ALPE,ABY0111,GN,N,B,CTCT,,,9000,fresh water,,,,,PRI-9777821,IND,AOP,0.00,NA,,REC-20935887,PUNZ,HHR,D261,ALPE,Y,N,N,,,,MET-13334527,ALPE,5,Y,N,N,N,1,N,,STA-4313832,002,0,CT14033680,ADD-5155513,,Erapid 300,Timaru & Oamaru,Opuha Dam Road,Sherwood Downs,Fairlie,0,Opuha Embedded Generator,1431017.04,5125897.96, DET,0000000001ALAE7,01/04/1999,01/04/1999,01/04/2015,22/09/2015,NET-6601766,ALPE,TIM0111,GN,N,L,,Streetlighting,,0,,04/10/2002,,,,PRI-10129867,ASSLCA,ALV,572.00,NA,,REC-21674685,CTCT,HHR,C243,CTCT,Y,N,Y,14406,0,,MET-13821561,CTCT,1,Y,N,N,N,1,N,,1999D335761-S,002,0,UPE10.00159,ADD-5129948,,,Timaru & Oamaru,All TIM0111 Streets,,Timaru,0,Streetlighting,1459083.85,5085121,

xsv count --skip-header LIS20150922155412.txt xsv count --skip-first-line LIS20150922155412.txt

I can almost achieve what I want with tail or sed eg. tail -n +2 LIS20150922155412.txt | xsv count But the count is off by one because xsv doesn’t count the first line.

Actually the more I think about this the bigger can of worms it becomes. One reason xsv appealed is for splitting files, but in that case I would want to ignore the first line when opening the file but include it in the split files.

xsv split split_dir LIS20150922155412.txt CSV error: record 1 (line: 1, byte: 53): found record with 64 fields, but the previous record has 7 fields

Edit: formatting

CGUTA commented 3 years ago

Hello:

I ran into a similar problem earlier and I also feel like xsv could benefit from a command to skip some lines.

Justification: In my case the CSVs had extra metadata headers bcs some other code that ingests them needed to know the dimensions beforehand and this is specified in the headers. I have also seen this practice in other specialized files like MatrixMarket format. These metadata lines are sometimes marked as comments but the escape symbol depends on use case.

Illustration:

1

# 3 3
,a,b
a,1,0
b,0,1

2

% MyCSV v1.4
% other metadata
% foo
,a,b
a,1,0
b,0,1

Proposal: Since its a formatting issue and xsv is designed to be pipe friendly the best option maybe would be to add it to xsv format: xsv format --skip 2

CGUTA commented 3 years ago

For the case of split; to output the metadata lines together with the output would not be universally a good idea since the metadata sometimes is related to the size and contents of the table before splitting. But the metadata of csv is not always file content related and is sometimes related to the version of the specialized CSV file format and other metadata that people have found useful to store inside the csv.

At the current state, if one would want to use xsv split to split the files and fix the headers later it would be hard because xsv controls the chunk writing. Maybe prewriting the header to the precreated chunk files would be an option but I believe then xsv split would overwrite the files. Making an append flag to xsv split does not look logical.

Since xsv split is already a sink function (is not pipeable) possible solutions are: Silently add lines ignored with putative --skip (Not ideal since code would have to track the skip lines and the header) add a parameter --header-size that states the size of the header with metadata that msut be prepended (Also think is not most clean same problem as above)

My preferred solution add a parameter --header-file to xsv split that points to the file that contains the header lines that have to be prepended to the chunks. This allows the user a bit of control about the chunk creation.

Your workflow would be (using illustration 2):

SKIP=3
head -n $SKIP big.csv > metadata.tmp
xsv fmt --skip $SKIP | xsv foo | xsv foo | xsv split -s 1 --header-file metadata.tmp

BurntSushi / xsv

Exclude header option #269