Optimize file loading time

RamonUnch commented 1 year ago

For now GreenPad reads the file line by line, this is hugely inefficient and forces the use of CreateFileMapping (or it would be crazy slow).

The idea is to use the fact that InsertingOperation() can handle multi-line buffer, so we can read a big buffer at once, which will be much faster. I already got form 28s to 15s to load a big test file (Latin1).

I just need to test more carefully and to apply the same logic to all ReadLine functions.

The question is: Why did k.inaba made this line by line reading that is actually more complex and much slower? Am I missing something?

This is also a necessary step to use piping with uconf.exe later, otherwise overhead would be too large.

Progress on : #14
Progress on: #60

@roytam1 You might be interested by this patch, I did very little testing, I will try to make some regression testing for this, If you could double-check it would be great (whenever you got spare time of course).

roytam1 commented 1 year ago

I tried to port this patch to my tree:

diff --git a/GreenPad/kilib/textfile.cpp b/GreenPad/kilib/textfile.cpp
index 182e2bf..c04f0ca 100644
--- a/GreenPad/kilib/textfile.cpp
+++ b/GreenPad/kilib/textfile.cpp
@@ -53,17 +53,13 @@ struct rBasicUTF : public ki::TextFileRPimpl
        state = EOF;

        // 改行が出るまで読む
-       unicode *w=buf, *e=buf+siz;
+       unicode *w=buf, *e=buf+siz-1;
        while( !Eof() )
        {
            *w = GetC();
            if(BOF && *w!=0xfeff) BOF = false;
-           if( *w==L'\r' || *w==L'\n' )
-           {
-               state = EOL;
-               break;
-           }
-           else if( !BOF && ++w==e )
+
+           if( !BOF && ++w==e )
            {
                state = EOB;
                break;
@@ -71,10 +67,9 @@ struct rBasicUTF : public ki::TextFileRPimpl
            if(BOF) BOF = false;
        }

-       // 改行コードスキップ処理
-       if( state == EOL )
-           if( *w==L'\r' && !Eof() && PeekC()==L'\n' )
-               Skip();
+       // If the end of the buffer contains half a DOS CRLF
+       if( *(w-1)==L'\r' && PeekC() == L'\n' )
+           Skip();

        if(BOF) BOF = false;
        // 読んだ文字数
@@ -919,28 +914,28 @@ struct rMBCS : public TextFileRPimpl
    size_t ReadLine( unicode* buf, ulong siz )
    {
        // バッファの終端か、ファイルの終端の近い方まで読み込む
-       const char *p, *end = Min( fb+siz/2, fe );
+       // Read to the end of the buffer or near the end of the file
+       const char *p, *end = Min( fb+siz/2-2, fe );
        state = (end==fe ? EOF : EOB);

-       // 改行が出るまで進む
+       // 改行が出るまで進む,  Proceed until the line breaks.
        for( p=fb; p<end; )
-           if( *p=='\r' || *p=='\n' )
-           {
-               state = EOL;
-               break;
-           }
 #if !defined(TARGET_VER) || (defined(TARGET_VER) && TARGET_VER>350)
-           else if( (*p) & 0x80 && p+1<fe )
+           if( (*p) & 0x80 && p+1<fe )
            {
                p = next(readcp,p,0);
            }
-#endif
            else
+#endif
            {
                ++p;
            }

-       // Unicodeへ変換
+       // If the end of the buffer contains half a DOS CRLF
+       if( *(p-1)=='\r' && *(p) =='\n' )
+           ++p;
+
+       // Unicodeへ変換, convertion to Unicode
        ulong len;
 #ifndef _UNICODE
        len = conv( readcp, 0, fb, p-fb, buf, siz );
@@ -954,10 +949,6 @@ struct rMBCS : public TextFileRPimpl
            len = ::MultiByteToWideChar( readcp, 0, fb, int(p-fb), buf, siz );
        }
 #endif
-       // 改行コードスキップ処理
-       if( state == EOL )
-           if( *(p++)=='\r' && p<fe && *p=='\n' )
-               ++p;
        fb = p;

        // 終了
@@ -1130,15 +1121,15 @@ struct rIso2022 : public TextFileRPimpl
        len=0;

        // バッファの終端か、ファイルの終端の近い方まで読み込む
-       const uchar *p, *end = Min( fb+siz/2, fe );
+       const uchar *p, *end = Min( fb+siz/2-2, fe );
        state = (end==fe ? EOF : EOB);

        // 改行が出るまで進む
        for( p=fb; p<end; ++p )
            switch( *p )
            {
-           case '\r':
-           case '\n': state =   EOL; goto outofloop;
+//         case '\r':
+//         case '\n': state =   EOL; goto outofloop;
            case 0x0F:    GL = &G[0]; break;
            case 0x0E:    GL = &G[1]; break;
            case 0x8E: gWhat =     2; break;
@@ -1160,10 +1151,9 @@ struct rIso2022 : public TextFileRPimpl
            }
        outofloop:

-       // 改行コードスキップ処理
-       if( state == EOL )
-           if( *(p++)=='\r' && p<fe && *p=='\n' )
-               ++p;
+       // If the end of the buffer contains half a DOS CRLF
+       if( *(p-1)=='\r' && *p=='\n' )
+           ++p;
        fb = p;

        // 終了

and yeah it does load faster and less memory usage, but it seems to be regressed (ConfigManager doesn't read type\default.lay correctly and not respecting line number showing option ln=1)

EDIT: because ConfigManager::LoadLayout() really wants to read line by line.

RamonUnch commented 1 year ago

Thanks foe the info. Indeed layouts need to be read line by line in ConfigManager::LoadLayout() The simplest would be to make both readline + readMultiLine functions available but I could also modify the LoadLayout function.

Everything seems fine for the kwd file that are not read the same way. I need to test Opening/saving large valid files and check that the result is the same than previousGreenPad.

RamonUnch commented 1 year ago

the only difference I could find is with binary files, where invalid sequences have different meaning depending on where you split them. It should not make any change to valid sequences. I will try for some more time before merging.

RamonUnch / GreenPad

Optimize file loading time #75