Closed lqdev closed 2 years ago
This is a consequence of how LoadCsv
is implemented. At the moment, we're blindly splitting on "," on each line we read from a file. The ideal solution is to parse each line and build the column values as we go. That's essentially what CsvReader
(an external lib) does. For a quick-fix, I can run a RegEx on each line being read in and split the values that way. It would be less efficient though.
For a work-around to get unblocked right now, a RegEx that does the following can be run on the file: "(any character that is not ,),(any character that is not ,)" should be replaced by "(any character that is not ,)(any character that is not ,)". I do recognize that it's not a satisfactory solution though
So 1 idea I had here is to use a parser generator for csv. If I specified a grammar for csv, one that allowed quoting inside fields, I should be able to use the generated parser to parse the csv values. The main draw for me here is that we get the parser for free and need to only maintain the grammar. I do want to check the perf though.
Hmm - CsvHelper works perfectly well as far as I can tell and deals with all the vagaries of RFC4180 compliance. I understand why you'd want to avoid taking a dependency on an external package for a core library but in the short/medium term why not provide Csv interop as a separate 'extension' package that does use CsvHelper to do the heavy lifting? I have plenty of real data I want to play with - all stored in CSV files and the inability to correctly parse them here is a bit of a pain! :-)
I am also stuck due this issue and can't proceed. While working on COVID-19 dataset, I found LoadCsv method doesn't handle column values containing separator.
Dataset Source: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv For E.g, Consider the first row from the dataset.
53031,Jefferson,Washington,US,2020-03-27 22:14:55,47.75003077,-123.56097040000002,11,0,0,0,"Jefferson, Washington, US"
The value in the last column had a comma in the value and if we split above line by comma, it'll give us 14 values instead of 12. This mismatch in LoadCsv throws an exception.
There is already a discussion over SO on the same and I tried with a solution of CsvParser(..) and it seems to be working. https://stackoverflow.com/questions/6542996/how-to-split-csv-whose-columns-may-contain
Can we have a workaround similar to CsvParser(..) till we get a proper fix?
@praveenraghuvanshi1512 - you may wish to use Deedle dataframes in the interim .... it's probably not as performant but seems to work well and has no problems with parsing CSV files as far as I can see... http://bluemountaincapital.github.io/Deedle/csharpintro.html An alternative is to use the CsvHelper library (https://github.com/JoshClose/CsvHelper) to load in the csv and then create the dataframe from the objects.
Well, I would not use Deedle's csv loading. It uses exactly the same csv parser and inference in FSharp.Data.
It works OK on small csv files. The performance is bad when loading big csv file like more than 1gb. Sometimes it runs out of memory. It has also drawbacks such as not recognizing scientific numbers such as 1e6. Just search csv in Deedle's issue and you will find a lot of problems. I've managed to fixed a few trivial ones but the performance is a main issue.
Thanks @zyzhu - I could have sworn Deedle handled embedded commas correctly but haven't used it for a while so will defer to your experience! Maybe the best workaround then is to write a small utility using CsvHelper to load in the original rows and then remove embedded commas an a cell by cell basis before writing them back out again.
@praveenraghuvanshi1512 You can use this code with CsvHelper to do a quick file conversion..
private static void Main(string[] args)
{
//usage prog.exe in.csv out.csv
var (inFile, outFile) = (args[0], args[1]);
var culture = CultureInfo.InvariantCulture;
using var reader = new StreamReader(inFile);
using var csvIn = new CsvReader(reader, new CsvConfiguration(culture));
using var recordsIn = new CsvDataReader(csvIn);
using var writer = new StreamWriter(outFile);
using var outCsv = new CsvWriter(writer, culture);
//change this strategy if you prefer to just delete the commas
static string Sanitise(string s) => s.Replace(",", "_");
while (recordsIn.Read())
{
var columns = recordsIn.FieldCount;
for (var i = 0; i < columns; i++)
outCsv.WriteField(Sanitise(recordsIn.GetString(i)));
outCsv.NextRecord();
}
}
If I may just chime in, since there is enough interest here in getting a more featured CSV reader, it might be worth it to create a new csproj in corefxlab called DataFrame_IO
and write a LoadCsv
method taking a dependency on CsvHelper
. Here is a catch-all issue I opened to address this: https://github.com/dotnet/corefxlab/issues/2785.
@praveenraghuvanshi1512 You can use this code with CsvHelper to do a quick file conversion..
private static void Main(string[] args) { //usage prog.exe in.csv out.csv var (inFile, outFile) = (args[0], args[1]); var culture = CultureInfo.InvariantCulture; using var reader = new StreamReader(inFile); using var csvIn = new CsvReader(reader, new CsvConfiguration(culture)); using var recordsIn = new CsvDataReader(csvIn); using var writer = new StreamWriter(outFile); using var outCsv = new CsvWriter(writer, culture); //change this strategy if you prefer to just delete the commas static string Sanitise(string s) => s.Replace(",", "_"); while (recordsIn.Read()) { var columns = recordsIn.FieldCount; for (var i = 0; i < columns; i++) outCsv.WriteField(Sanitise(recordsIn.GetString(i))); outCsv.NextRecord(); } }
Thank you for reference, will try it.
Unable to repro in version 0.20.0-preview.22313.1. Closing for now.
Given the following code:
And data that looks like the snippet below:
If a row contains the separator character, although surrounded in quotes:
The following error is thrown:
This is the source dataset for reference: https://data.boston.gov/dataset/income-restricted-housing