BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
151 stars 8 forks source link

Column Separator Detection Inconsistent #55

Closed SpiderManAPV closed 1 year ago

SpiderManAPV commented 1 year ago

I need to edit a CSV file report that gets generated weekly. This report always has the format laid out the exact same every single time. About 75-80% of the time CSV lint will pick up the formatting without problem. However, once every month or two it will be unable to detect the column separator despite the header row being identical and the data nearly identical. I can't identify any patterns tot distinguish why it picks it up sometimes and why other times it won't.

If it would be useful I'd be happy to include a copy of the file where it does and another where it doesn't pick up the column separator so you can try to figure out where it's getting stuck. I'll tell you now, however, that the header row is always the exact same.


Time,Caller Name,Caller Number,Callee Name,Callee Number,DOD,DID,Call Duration,Talk Duration,Status,Source Trunk,Destination Trunk,Communication Type,PIN Code,Caller IP Address,Recordfile
SpiderManAPV commented 1 year ago

Also, for what it might be worth, the columns do work correctly if I open the CSV lint panel in one of the files that does work then copy/paste the scheme over. Not sure if that matters at all.

BdR76 commented 1 year ago

Thanks for describing your issue, it sounds interesting, probably some edge case in the data. Yes I am interested in seeing the file to see if the plugin can be fixed. You can send the non-working file to my e-mail (see at bottom main github page) I wouldn't recommend posting your data publicly here on github.

Also, recently I fixed a bug in the csv detection part, see this commit. Does your file contain any CrLf's, so quoted strings with new lines/carriage returns in them? If so, then maybe it's already fixed when you use the "beta" build DLL of the plugin?

BdR76 commented 1 year ago

Without knowing what's in the data file, I can't reproduce your error. You can find my e-mail at the bottom of the page here.

I'll close this issue for now. When I receive an example file then this one can be re-opened.

SpiderManAPV commented 1 year ago

Sorry for the delay. Email sent with example files.

BdR76 commented 1 year ago

I received the data through mail, thanks. In the file that isn't properly detected, it looks like some values in the second column contain a comma , but they aren't quoted. That messes up the auto-detection.

I won't repeat your data here, but for example your data is something like this

Time,Name,Status
2023-04-12 17:18:15,DOUGLAS,Completed
2023-04-12 17:20:44,FLORES,DE,Completed
2023-04-12 17:21:51,MURPHY,Completed

But it should be like this, with quotes

Time,Name,Status
2023-04-12 17:18:15,DOUGLAS,Completed
2023-04-12 17:20:44,"FLORES,DE",Completed
2023-04-12 17:21:51,MURPHY,Completed

Either the second column should have quotes, or it should use a different separator, that would also avoid this issue. So using tabs or semicolons instead of comma.

BdR76 commented 1 year ago

There already were some updates in the latest development release of the plug-in, and that particular file will be correctly detected automatically. However, the problematic lines in the file will never be parsed correctly (by any app or program) because the extra comma causes all subsequent values to be "shifted" one column over to the right.

In the current version of the plugin, you can go to Plugins > CSV Lint > CSV Lint window, uncheck the "auto-matic" checkbox, press "Detect columns" and then manually select comma , as separator and press OK. If you then press "Validate data" it will display validation errors "error line 5: Too many columns, error line 7: Too many columns, error line 11.." etc. for every line where the name contains a extra comma ,. Btw also notice how the column colors at the end of those lines are different, that's an indication that all values have shifted to right.

SpiderManAPV commented 1 year ago

That makes complete sense, but unfortunately if I was able to customize the data that was being exported from the particular program my primary need for CSVLint would actually be gone. One of my main use cases for it is so I can skim the exported data and tell by the color if something got put in the wrong column and just delete the extra comma so things render properly in Excel. The extra commas breaking CSVLint is a cruel irony for me there lol.

BdR76 commented 1 year ago

If the only issue with the file is the incorrect commas in the second column, then I would just pre-process all files using a Python script, instead of manually removing the commas. So something like:

# repare incorrect csv file
# check if too many column (=2nd column contains unquoted comma)
import csv

input_file = "Not Working File.csv"
output_file = "Not Working File(2).csv"

# prepare writing to the new output file
with open(output_file, 'w+', newline ='') as outfile:
    # prepare reading from the input file
    with open(input_file) as file:
        newcsv = csv.writer(outfile)
        for line in file:
            data = line.strip().split(',')
            # expected 16 columns, if more then fix it
            if len(data) > 16:
                # merge 2nd and 3rd value and remove 3rd value
                value3 = data.pop(2) # 3rd; zero based
                data[1] = (data[1] + ',' + value3)
            newcsv.writerow(data)