BurntSushi / rust-csv

A CSV parser for Rust, with Serde support.
The Unlicense
1.68k stars 218 forks source link

Skipping empty rows #159

Open binh-vu opened 5 years ago

binh-vu commented 5 years ago

What version of the csv crate are you using?

1.1.1

Briefly describe the question, bug or feature request.

The library skips the empty rows even when flexible is set to be true.

The default behavior in python csv library, Excel, Number is keeping empty rows, so I think we should follow that (the users may have hard time to find the different between the data they see in Excel and the result from the Rust code).

Include a complete program demonstrating a problem.

use csv; // 1.1.1

fn main() {
    let s = "year,make\n\n1948,Porsche\n1967,Ford";

    let reader = csv::ReaderBuilder::new()
        .has_headers(false)
        .flexible(true)
        .from_reader(s.as_bytes());

    for (i, record) in reader.into_records().enumerate() {
        println!("row: {} - record: {:?}", i, record.unwrap());
    }
}

What is the observed behavior of the code above?

row: 0 - record: StringRecord(["year", "make"])
row: 1 - record: StringRecord(["1948", "Porsche"])
row: 2 - record: StringRecord(["1967", "Ford"])

What is the expected or desired behavior of the code above?

row: 0 - record: StringRecord(["year", "make"])
row: 1 - record: StringRecord([])
row: 2 - record: StringRecord(["1948", "Porsche"])
row: 3 - record: StringRecord(["1967", "Ford"])

Here is the python code for the same csv:

import io, csv
s = "year,make\n\n1948,Porsche\n1967,Ford"
print(list(csv.reader(io.StringIO(s))))
# output >>> [['year', 'make'], [], ['1948', 'Porsche'], ['1967', 'Ford']]
jaredforth commented 3 years ago

Has this issue still not been addressed?

BurntSushi commented 3 years ago

@jaredforth Correct.

Re-reading this issue now, I am just not convinced that this change should be made. At the very least, this would be a breaking change, and I'm not convinced of the utility of including empty rows. The only reason given is for consistency with Python's csv reader, but I don't think that's a good enough reason on its own.

JamesMc86 commented 1 year ago

To add another voice to this - this has also just caught me out. I'm running in flexible mode and an empty row has meaning to the structure of the file I'm trying to parse.

I have a workaround in this case but wanted to add another voice to this feature

BurntSushi commented 1 year ago

Folks could help their case here by providing a more detailed motivation for why they need this feature. Give some examples. Explain why there is no other way or easy workaround.

gblock0 commented 1 year ago

We are working on a POC with this, so it's not blocking anything at the moment, but I figured I'd add our use case.

We use json-schema to define TSV/CSV files and then would like to use this library to help us validate that the TSV/CSV file adheres to the schema. To give a more concrete example, we allow some fields to be required and some to be nullable. The nullable can be empty string, "N/A", or really anything we want. We'd like to parse the whole file and know if there is an empty line because if there are required columns that don't exist or the nullable column definition is not adhered to, then it would fail the validation.

Would it be possible to add this as an optional param like "skip_empty_lines" and defaults to true so it wouldn't be a breaking change?

BurntSushi commented 1 year ago

@gblock0 I see. So it sounds like the idea here is that you want to do some kind of verification step that possibly precludes the existence of empty lines in a CSV file, but since this parser completely skips them and makes them inaccessible to you, it's not possible to (easily) check whether they exist.

Would it be possible to add this as an optional param like "skip_empty_lines" and defaults to true so it wouldn't be a breaking change?

Yes, if I were to add this, then it would have to come in under an option because I won't release a csv 2.0 just for this.

FWIW, I don't see this happening any time soon. IIRC, this behavior is baked in all the way down in csv-core's finite state machine. I am not convinced that this is really something worth supporting, but if someone did come up with a simple patch for this (and is an opt-in feature), then I think I'd be willing to bring it in.

gblock0 commented 1 year ago

@BurntSushi yep exactly. Totally understand that it's not priority, and if I was more familiar with Rust I'd take a swing at it.

Thanks for hearing me out and all the work you do!

jccalvojackson commented 1 year ago

tried to do the first problem of AOC but as it skips the empty lines, csv is not useful here 🤷‍♂️

BurntSushi commented 1 year ago

I only skimmed the problem, but I don't see any CSV data there. It's just line delimited. There's only a single column.

jccalvojackson commented 1 year ago

yeah, it is not csv in the strict sense, just a file. I'm using csv as a generic file reader whenever the file has a column structure (this being the simplest with just one column). So granted, not a use case that really justifies this change.

Yarn commented 1 year ago

To add another specific use case, I'm using xsv fixlengths and xsv table to view a rather interesting csv, essentially multiple csvs and some other random info separated by blank lines. Losing the blank lines makes it substantially harder to distinguish between different sections quickly.

amansawhney commented 1 year ago

I have a use case for this feature. I am reading in a CSV with one column that is deserialized using serde. The column is an optional usize. If the rows are skipped all the NONE vals are skipped and hence the data is parsed incorrectly.

amansawhney commented 1 year ago

I have a use case for this feature. I am reading in a CSV with one column that is deserialized using serde. The column is an optional usize. If the rows are skipped all the NONE vals are skipped and hence the data is parsed incorrectly.

I was actually able to fix this by wrapping every value in my csv with quotes when I serialized the data earlier in the pipeline. I still think this feature should be added since not everyone has the flexibility of being able to change their data pipelines.

dgendill commented 3 months ago

This feature would be very useful to me as well. Same as Yarn's comment, I'm working with a CSV that basically has multiple tables embedded inside of it. Here is a very minimal example or a real world CSV that some financial companies export.

Account Number,Investment Name,Symbol,Shares,Share Price,Total Value,
1234567,NVIDIA Corp,NVDA,1,130.7,130.7,

Account Number,Trade Date,Settlement Date,Transaction Type,Transaction Description,Investment Name,Symbol,Shares,Share Price,Principal Amount,Commissions and Fees,Net Amount,Accrued Interest,Account Type,
1234567,2023-03-27,2023-03-27,Buy,Buy,Nvidia Corp,NVDA,0.00000,1.0,0.30,0.0,0.30,0.0,CASH,

Being able to see the trailing empty lines would signal that I'm approaching a different table of data.

I was able to test Yarn's PR and it seems to work the way I expect. Looks like a non-breaking change that only adds a new option to the builder. I tested by changing my Cargo.toml file to point to Yarn's PR branch...

[dependencies]
csv = { git = "https://github.com/Yarn/rust-csv.git" }

And then calling skip_blank_lines(false) on the builder.

use std::{error::Error, process};

fn example() -> Result<(), Box<dyn Error>> {
    let file = std::fs::File::open("documents/example2.csv")?;
    let reader = std::io::BufReader::new(file);

    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .flexible(true)
        .skip_blank_lines(false)
        .trim(csv::Trim::None)
        .from_reader(reader);    

    for result in rdr.records() {        
        let record = result?;
        println!("{:?}", record);
    }
    Ok(())
}

fn main() {
    if let Err(err) = example() {
        println!("error running example: {}", err);
        process::exit(1);
    }
}