dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.89k forks source link

[DataFrame] Allow quoting when loading CSV #5674

Closed lqdev closed 2 years ago

lqdev commented 4 years ago

Given the following code:

let df = DataFrame.LoadCsv("Data/test.csv")

And data that looks like the snippet below:

Project Name,Neighborhood,Zip Code,TtlProjUnits,RentUnits,OwnUnits,TtlMarket,MarketRent,MarketOwn,Total Income- Restricted,Income- Restricted Rental,Income- Restricted Ownership,Tenure,Public/ Private,Includes Elderly Units?,Section_8
AB&W Cooperative Building,Roxbury,02121,24,24,0,0,0,0,24,24,0,Rental,Private,,

If a row contains the separator character, although surrounded in quotes:

"Berkeley Investments, Inc.",South Boston,02127,97,0,97,89,0,89,8,0,8,Ownership,Private,,

The following error is thrown:

Unhandled exception. System.ArgumentException: Expected value to be of type System.Single (Parameter 'val')
   at Microsoft.Data.Analysis.DataFrame.AppendRow(List`1 columns, Int64 rowIndex, String[] values)
   at Microsoft.Data.Analysis.DataFrame.LoadCsv(Stream csvStream, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn)
   at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn)
   at Program.main(String[] argv) in C:\Users\lqdev\Development\dotnet\FsDataFrameSample\FsDataFrameSample\Program.fs:line 9   

This is the source dataset for reference: https://data.boston.gov/dataset/income-restricted-housing

pgovind commented 4 years ago

This is a consequence of how LoadCsv is implemented. At the moment, we're blindly splitting on "," on each line we read from a file. The ideal solution is to parse each line and build the column values as we go. That's essentially what CsvReader(an external lib) does. For a quick-fix, I can run a RegEx on each line being read in and split the values that way. It would be less efficient though.

For a work-around to get unblocked right now, a RegEx that does the following can be run on the file: "(any character that is not ,),(any character that is not ,)" should be replaced by "(any character that is not ,)(any character that is not ,)". I do recognize that it's not a satisfactory solution though

pgovind commented 4 years ago

So 1 idea I had here is to use a parser generator for csv. If I specified a grammar for csv, one that allowed quoting inside fields, I should be able to use the generated parser to parse the csv values. The main draw for me here is that we get the parser for free and need to only maintain the grammar. I do want to check the perf though.

NeilMacMullen commented 4 years ago

Hmm - CsvHelper works perfectly well as far as I can tell and deals with all the vagaries of RFC4180 compliance. I understand why you'd want to avoid taking a dependency on an external package for a core library but in the short/medium term why not provide Csv interop as a separate 'extension' package that does use CsvHelper to do the heavy lifting? I have plenty of real data I want to play with - all stored in CSV files and the inability to correctly parse them here is a bit of a pain! :-)

praveenraghuvanshi1512 commented 4 years ago

I am also stuck due this issue and can't proceed. While working on COVID-19 dataset, I found LoadCsv method doesn't handle column values containing separator.

Dataset Source: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv For E.g, Consider the first row from the dataset.

53031,Jefferson,Washington,US,2020-03-27 22:14:55,47.75003077,-123.56097040000002,11,0,0,0,"Jefferson, Washington, US"

The value in the last column had a comma in the value and if we split above line by comma, it'll give us 14 values instead of 12. This mismatch in LoadCsv throws an exception.

There is already a discussion over SO on the same and I tried with a solution of CsvParser(..) and it seems to be working. https://stackoverflow.com/questions/6542996/how-to-split-csv-whose-columns-may-contain

Can we have a workaround similar to CsvParser(..) till we get a proper fix?

NeilMacMullen commented 4 years ago

@praveenraghuvanshi1512 - you may wish to use Deedle dataframes in the interim .... it's probably not as performant but seems to work well and has no problems with parsing CSV files as far as I can see... http://bluemountaincapital.github.io/Deedle/csharpintro.html An alternative is to use the CsvHelper library (https://github.com/JoshClose/CsvHelper) to load in the csv and then create the dataframe from the objects.

zyzhu commented 4 years ago

Well, I would not use Deedle's csv loading. It uses exactly the same csv parser and inference in FSharp.Data.

It works OK on small csv files. The performance is bad when loading big csv file like more than 1gb. Sometimes it runs out of memory. It has also drawbacks such as not recognizing scientific numbers such as 1e6. Just search csv in Deedle's issue and you will find a lot of problems. I've managed to fixed a few trivial ones but the performance is a main issue.

NeilMacMullen commented 4 years ago

Thanks @zyzhu - I could have sworn Deedle handled embedded commas correctly but haven't used it for a while so will defer to your experience! Maybe the best workaround then is to write a small utility using CsvHelper to load in the original rows and then remove embedded commas an a cell by cell basis before writing them back out again.

NeilMacMullen commented 4 years ago

@praveenraghuvanshi1512 You can use this code with CsvHelper to do a quick file conversion..

private static void Main(string[] args)
        {
            //usage prog.exe in.csv out.csv
            var (inFile, outFile) = (args[0], args[1]);

            var culture = CultureInfo.InvariantCulture;
            using var reader = new StreamReader(inFile);
            using var csvIn = new CsvReader(reader, new CsvConfiguration(culture));
            using var recordsIn = new CsvDataReader(csvIn);
            using var writer = new StreamWriter(outFile);
            using var outCsv = new CsvWriter(writer, culture);

            //change this strategy if you prefer to just delete the commas
            static string Sanitise(string s) => s.Replace(",", "_");

            while (recordsIn.Read())
            {
                var columns = recordsIn.FieldCount;
                for (var i = 0; i < columns; i++)
                    outCsv.WriteField(Sanitise(recordsIn.GetString(i)));
                outCsv.NextRecord();
            }
        }
pgovind commented 4 years ago

If I may just chime in, since there is enough interest here in getting a more featured CSV reader, it might be worth it to create a new csproj in corefxlab called DataFrame_IO and write a LoadCsv method taking a dependency on CsvHelper. Here is a catch-all issue I opened to address this: https://github.com/dotnet/corefxlab/issues/2785.

praveenraghuvanshi1512 commented 4 years ago

@praveenraghuvanshi1512 You can use this code with CsvHelper to do a quick file conversion..

private static void Main(string[] args)
        {
            //usage prog.exe in.csv out.csv
            var (inFile, outFile) = (args[0], args[1]);

            var culture = CultureInfo.InvariantCulture;
            using var reader = new StreamReader(inFile);
            using var csvIn = new CsvReader(reader, new CsvConfiguration(culture));
            using var recordsIn = new CsvDataReader(csvIn);
            using var writer = new StreamWriter(outFile);
            using var outCsv = new CsvWriter(writer, culture);

            //change this strategy if you prefer to just delete the commas
            static string Sanitise(string s) => s.Replace(",", "_");

            while (recordsIn.Read())
            {
                var columns = recordsIn.FieldCount;
                for (var i = 0; i < columns; i++)
                    outCsv.WriteField(Sanitise(recordsIn.GetString(i)));
                outCsv.NextRecord();
            }
        }

Thank you for reference, will try it.

luisquintanilla commented 2 years ago

Unable to repro in version 0.20.0-preview.22313.1. Closing for now.