Open hmaddocks opened 3 years ago
Could you please provide some sample input along with some commands and how you would hope they work?
For example, your description isn't totally clear to me. It sounds like you just want to skip the first line and not the header.
It sounds like you just want to skip the first line
Yes, that is what I want. Here's a sample.
HDR,RSICPLIST,RGST,ALPE,22/09/2015,15:54:12,00034524 DET,0000000000AL6A2,01/05/2006,01/05/2006,01/11/2014,22/09/2015,NET-6600778,ALPE,ABY0111,GN,N,B,CTCT,,,9000,fresh water,,,,,PRI-9777821,IND,AOP,0.00,NA,,REC-20935887,PUNZ,HHR,D261,ALPE,Y,N,N,,,,MET-13334527,ALPE,5,Y,N,N,N,1,N,,STA-4313832,002,0,CT14033680,ADD-5155513,,Erapid 300,Timaru & Oamaru,Opuha Dam Road,Sherwood Downs,Fairlie,0,Opuha Embedded Generator,1431017.04,5125897.96, DET,0000000001ALAE7,01/04/1999,01/04/1999,01/04/2015,22/09/2015,NET-6601766,ALPE,TIM0111,GN,N,L,,Streetlighting,,0,,04/10/2002,,,,PRI-10129867,ASSLCA,ALV,572.00,NA,,REC-21674685,CTCT,HHR,C243,CTCT,Y,N,Y,14406,0,,MET-13821561,CTCT,1,Y,N,N,N,1,N,,1999D335761-S,002,0,UPE10.00159,ADD-5129948,,,Timaru & Oamaru,All TIM0111 Streets,,Timaru,0,Streetlighting,1459083.85,5085121,
xsv count --skip-header LIS20150922155412.txt
xsv count --skip-first-line LIS20150922155412.txt
I can almost achieve what I want with tail
or sed
eg.
tail -n +2 LIS20150922155412.txt | xsv count
But the count is off by one because xsv doesn’t count the first line.
Actually the more I think about this the bigger can of worms it becomes. One reason xsv appealed is for splitting files, but in that case I would want to ignore the first line when opening the file but include it in the split files.
xsv split split_dir LIS20150922155412.txt CSV error: record 1 (line: 1, byte: 53): found record with 64 fields, but the previous record has 7 fields
Edit: formatting
Hello:
I ran into a similar problem earlier and I also feel like xsv could benefit from a command to skip some lines.
Justification: In my case the CSVs had extra metadata headers bcs some other code that ingests them needed to know the dimensions beforehand and this is specified in the headers. I have also seen this practice in other specialized files like MatrixMarket format. These metadata lines are sometimes marked as comments but the escape symbol depends on use case.
Illustration:
1
# 3 3
,a,b
a,1,0
b,0,1
2
% MyCSV v1.4
% other metadata
% foo
,a,b
a,1,0
b,0,1
Proposal:
Since its a formatting issue and xsv is designed to be pipe friendly the best option maybe would be to add it to xsv format:
xsv format --skip 2
For the case of split; to output the metadata lines together with the output would not be universally a good idea since the metadata sometimes is related to the size and contents of the table before splitting. But the metadata of csv is not always file content related and is sometimes related to the version of the specialized CSV file format and other metadata that people have found useful to store inside the csv.
At the current state, if one would want to use xsv split
to split the files and fix the headers later it would be hard because xsv controls the chunk writing.
Maybe prewriting the header to the precreated chunk files would be an option but I believe then xsv split
would overwrite the files.
Making an append flag to xsv split
does not look logical.
Since xsv split
is already a sink function (is not pipeable) possible solutions are:
Silently add lines ignored with putative --skip
(Not ideal since code would have to track the skip lines and the header)
add a parameter --header-size that states the size of the header with metadata that msut be prepended (Also think is not most clean same problem as above)
My preferred solution
add a parameter --header-file
to xsv split
that points to the file that contains the header lines that have to be prepended to the chunks.
This allows the user a bit of control about the chunk creation.
Your workflow would be (using illustration 2):
SKIP=3
head -n $SKIP big.csv > metadata.tmp
xsv fmt --skip $SKIP | xsv foo | xsv foo | xsv split -s 1 --header-file metadata.tmp
TLDR; A command option to exclude the header line for CSV files that use headers incorrectly.
My apologies if this is a duplicate. I did search the issues but couldn't find anything. I work in an industry where CSV is the primary means for transferring data between industry participants. The irony being ALL the CSV files are invalid. The first line of the files don't describe the columns, they contain meta data. This causes almost all xsv commands to fail with "found record with 64 fields, but the previous record has 7 fields" errors. My proposal is to add a skip-header option. An enhancement to this could be to add dummy header labels possibly using the common spreadsheet method of using alpha labels. I know the fixlengths command offers a workaround, but it results in some not so pleasant output for other commands. I'm prepared to have a crack at this myself but first is there an appetite for adding this functionality?