BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
151 stars 8 forks source link

csv repair and troubleshooting #82

Closed SoftTools59654 closed 6 months ago

SoftTools59654 commented 6 months ago

csv repair and troubleshooting

Is it possible to add a tool that checks the syntax of csv files and fixes those that are problematic?

In most cases, either the number is low or high, or it does not exist in that line at all. In data with millions of records, it is impossible to check manually

and convert the file to a standard csv file

Because defective and faulty files cannot be converted to other files, such as JSON

BdR76 commented 6 months ago

It sounds like you're asking about fixing certain data errors in a specific data file, is that correct?

How to fix it depends on your use-case. Should it ignore the column values, replace values with something else, remove the entire row, what value is too low or too high, apply different conditions for different columns etc. That would require too much customisation and goes beyond adding a new option in this plug-in.

I think the best approach is to write a script (Python or other) to do what you want to do with your data file.

The CSV Lint plug-in can get you started with such a script. Open the csv file in Notepad++ and then go to the menu Plugins > CSV Lint > Generate Metadata and select Python script. This will generate a Python script to read the csv file and write to another csv file. However, you still need to develop and expand that script for your specific data processing/filtering and output file requirements.

Btw I can't help you develop a Python script but you can lookup a lot of things on Stackoverflow or ask ChatGPT.

BdR76 commented 6 months ago

Btw for large (>1GB) files, you could look at the pandas library in Python. The read_csv function has a chunksize parameter for processing such large files. I'm not familiar with it, but there is example code here