BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
160 stars 10 forks source link

CSVLint causes Notepad++ crash when attempting to open file with more than 2**31 - 1 bytes #93

Open molsonkiko opened 1 month ago

molsonkiko commented 1 month ago

To replicate

  1. Open any file with more than 2**31 - 1 bytes (i.e., the maximum size of a C# string, hereafter referred to as int.MaxValue).
  2. Notepad++ will crash

Expected behavior

Obviously I expect no crash. I would also expect CSVLint to explain to the user that the file is too large for CSVLint to open whenever they manually run a plugin command.

Debug info

Notepad++ v8.6.9   (64-bit)
Build time : Jul 12 2024 - 05:09:25
Path : C:\Program Files\Notepad++\notepad++.exe
Command Line : 
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
Periodic Backup : ON
OS Name : Windows 10 Home (64-bit)
OS Version : 22H2
OS Build : 19045.4651
Current ANSI codepage : 1252
Plugins : 
    ColumnsPlusPlus (1.1.1)
    ComparePlus (1.1)
    CSharpPluginPack (0.0.3.9)
    CSVLint (0.4.6.6)
    EnhanceAnyLexer (1.1.3)
    HTMLTag (1.4.3.1)
    HugeFiles (0.4.1)
    JsonTools (8.0.0.17)
    mimeTools (3.1)
    NavigateTo (2.7)
    NppConverter (4.6)
    NppExport (0.4)
    NppLspClient (0.0.21)
    PythonScript (3.0.16)
    XMLTools (3.1.1.13)

Proposed solution

The sneaky problem with the plugin infrastructure we use is that you can't even attempt to get the length of a file with length greater than int.MaxValue if the ISCintillaGateway.GetLength() method returns an int. The obvious solution is to have the method return a long, but it's annoying to do bounds checking on the long every time, so the best solution is to have a helper method.

I'm not going to submit a PR because I don't have the .NET Framework 4.0 targeting pack installed and I don't feel like installing it, but you can see how I changed JsonTools to fix this issue.

In short, you want to do the following:

  1. Change ScintillaGateway and IScintillaGateway so that the GetLength() method returns a long:
    public long GetLength()
    {
        return Win32.SendMessage(scintilla, SciMsg.SCI_GETLENGTH, (IntPtr)Unused, (IntPtr)Unused).ToInt64();
    }
  2. add a helper method (and related global variable) to your Main class that does the bounds checking and (optionally) warns the user if the file is too big:

    private static bool stopShowingFileTooLongNotifications = false;
    /// <summary>
    /// if <see cref="IScintillaGateway.GetLength"/> returns a number greater than <see cref="int.MaxValue"/>, return false and set len to -1.<br></br>
    /// Otherwise, return true and set len to the length of the document.<br></br>
    /// If showMessageOnFail, show a message box warning the user that the command could not be executed.
    /// </summary>
    public static bool TryGetLengthAsInt(out int len, bool showMessageOnFail = true)
    {
        long result = editor.GetLength();
        if (result > int.MaxValue)
        {
            len = -1;
            if (!stopShowingFileTooLongNotifications && showMessageOnFail)
            {
                stopShowingFileTooLongNotifications = MessageBox.Show(
                    "CSVLint cannot perform this plugin command on a file with more than 2147483647 bytes.\r\nDo you want to stop showing notifications when a file is too long?",
                    "File too long for CSVLint",
                    MessageBoxButtons.YesNo, MessageBoxIcon.Warning) == DialogResult.Yes;
    
            }
            return false;
        }
        len = (int)result;
        return true;
    }
  3. Refactor methods that look like this:
    // This is the *BAD* old version
    public static void MyMethod()
    {
        int len = editor.GetLength();
        // do something with len
    }
  4. into methods that look like this:
    // This is the *GOOD* new version
    public static void MyMethod()
    {
        if (!Main.TryGetLengthAsInt(out int len))
            return;
        // do something with len
    }
BdR76 commented 1 week ago

Thanks for posting the issue. I think Notepad++ runs into other issues as well when opening such large >2GB files.

I'll look into this when I have the time. For now, I've updated the generate_data.py script to generate a cardio.txt file of 2GB, if you change the TOTAL_LINES variable.