BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
151 stars 8 forks source link

Conversion to JSON is extremely slow #70

Closed molsonkiko closed 7 months ago

molsonkiko commented 1 year ago

I tried converting a 10MB CSV file to JSON. A preview:

CITY,TEMP,DATE
SEATTLE,3.1,19610101
SEATTLE,0.55,19610102
SEATTLE,0,19610103
SEATTLE,4.45,19610104
SEATTLE,8.35,19610105

Essentially one column being parsed as strings (length maybe 8-15 chars), one column as floats, one column as integers.

Conversion to JSON took over a minute, of which probably about 5 seconds is attributable to Scintilla's JSON lexer (based on comparison to PythonScript with the csv and json modules). I am pretty sure that the culprit is this block of code here:

// JSON escape characters
                                colvalue = colvalue.Replace("\\", "\\\\"); // \\  Backslash character
                                colvalue = colvalue.Replace("\b", "\\b"); // \b Backspace(ascii code 08)
                                colvalue = colvalue.Replace("\f", "\\f"); // \f Form feed(ascii code 0C)
                                colvalue = colvalue.Replace("\n", "\\n"); // \n New line
                                colvalue = colvalue.Replace("\r", "\\r"); // \r Carriage return
                                colvalue = colvalue.Replace("\t", "\\t"); // \t Tab
                                colvalue = colvalue.Replace("\"", "\\\""); // \"  Double quote

                                // put value in double quotes
                                colvalue = string.Format("\"{0}\"", colvalue);

This code copies colvalue 8 times, and then a 9th time when you finally append it to sb, which in turn puts a lot of pressure on the garbage collector. I've achieved significant performance improvements in JsonTools by removing such unnecessary copying of memory.

I recommend doing something like this (based on string conversion method in JsonTools):

sb.Append('"');
foreach (char c in colvalue)
{
    switch (c)
    {
        case '\\':   sb.Append("\\\\"   );
        case '"':    sb.Append("\\\""   );
        case '\x01': sb.Append("\\u0001");
        case '\x02': sb.Append("\\u0002");
        case '\x03': sb.Append("\\u0003");
        case '\x04': sb.Append("\\u0004");
        case '\x05': sb.Append("\\u0005");
        case '\x06': sb.Append("\\u0006");
        case '\x07': sb.Append("\\u0007");
        case '\x08': sb.Append("\\b"    );
        case '\x09': sb.Append("\\t"    );
        case '\x0A': sb.Append("\\n"    );
        case '\x0B': sb.Append("\\v"    );
        case '\x0C': sb.Append("\\f"    );
        case '\x0D': sb.Append("\\r"    );
        case '\x0E': sb.Append("\\u000E");
        case '\x0F': sb.Append("\\u000F");
        case '\x10': sb.Append("\\u0010");
        case '\x11': sb.Append("\\u0011");
        case '\x12': sb.Append("\\u0012");
        case '\x13': sb.Append("\\u0013");
        case '\x14': sb.Append("\\u0014");
        case '\x15': sb.Append("\\u0015");
        case '\x16': sb.Append("\\u0016");
        case '\x17': sb.Append("\\u0017");
        case '\x18': sb.Append("\\u0018");
        case '\x19': sb.Append("\\u0019");
        case '\x1A': sb.Append("\\u001A");
        case '\x1B': sb.Append("\\u001B");
        case '\x1C': sb.Append("\\u001C");
        case '\x1D': sb.Append("\\u001D");
        case '\x1E': sb.Append("\\u001E");
        case '\x1F': sb.Append("\\u001F");
        default:     sb.Append(c        );
    }
}
sb.Append('"');

I can't guarantee that this will fix all your problems (there's a lot of other unnecessary memcopy here) but it will probably reduce runtime by at least 20%.

molsonkiko commented 1 year ago

One other thought: at the end of the function, you should probably check if the JSON string is over a certain length (say 50 MB) and only turn on the JSON lexer if it's under that length. The JSON lexer is pretty slow and JsonTools leaves it off for pretty-printing big files.

BdR76 commented 10 months ago

Sorry for the late reply, the menu option Convert Data > JSON and also XML are indeed very slow. Never really noticed it because I don't use them as much.

The code with unnecessary memcopy, while ugly, doesn't contribute to the long processing time as far as I can see. It's mainly due to applying the syntax highlighting at the end. So I've added a new setting AutoSyntaxLimit which by default is 1MB.

I agree that your code is better than using multiple string.Replace() but I'll have to do some more testing with large files.

BdR76 commented 7 months ago

New setting AutoSyntaxLimit is available in the latest release