davidm / luacom

Microsoft Component Object Model (COM) binding for Lua
http://lua-users.org/wiki/LuaCom
Other
116 stars 51 forks source link

FileSystemObject Unicode filepath truncations #26

Open tatewise opened 2 years ago

tatewise commented 2 years ago

This involves using LuaCOM 1.3 and Lua 5.1 with Microsoft FileSystemObject running in Windows 10. If file paths include Unicode code-points in UTF-8 format then some methods return truncated file paths. e.g.

require("luacom")
fso = luacom.CreateObject("Scripting.FileSystemObject")
strParent = fso:GetParentFolderName("C:\\Root\\ĀĒĪŌŪ Unicode\\Folder")

That should return C:\Root\ĀĒĪŌŪ Unicode but actually returns C:\Root\ĀĒĪŌŪ Un truncated by 5 bytes. It is always truncated by the number of multi-byte UTF-8 code points.

Similar problems affect other methods such as fso:GetFolder(...) and fso:GetFile(...) regarding file path names.

When the same script is used with Lua 5.3 and Windows 10 everything works correctly. Unfortunately, I am forced to use a precompiled Lua 5.1 application. As a check, I ran similar FileSystemObject methods in Windows PowerShell on the same PC and that worked correctly. Another user has the same symptoms on a different PC with Lua 5.1 and Windows 11.

Is there any workaround for this problem?

robertlzj commented 2 years ago

I test your path on my computer, return is correct. Windows10, Code Page 936, script document in UTF8, lua 5.3, luacom 1.4?.

Also, could use regex

_,_,strParent = string.find('C:\\Root\\ĀĒĪŌŪ Unicode\\Folder','(.+)\\')
assert(strParent==[[C:\Root\ĀĒĪŌŪ Unicode]])

for fso is not strict I think

strParent = fso:GetParentFolderName([["C:\not exist directory\not exist file"]])
assert(strParent=="\"C:\\not exist directory")
--  strange path, truncated quotation?
tatewise commented 2 years ago

Yes, as I said, everything is OK in Lua 5.3 but is faulty in Lua 5.1. Yes, there are workarounds for GetParentFolderName(...) but not for GetFolder(...) and GetFile(...) and other methods.

robertlzj commented 2 years ago

Sorry, I missed the '5.1'. Test using 5.1, same issue.

assert(#"C:\\Root\\ĀĒĪŌŪ Unicode"==26 and #'ĀĒĪŌŪ'==10)
strParent = fso:GetParentFolderName("C:\\Root\\ĀĒĪŌŪ Unicode\\Folder")
assert(strParent==[[C:\Root\ĀĒĪŌŪ Un]] and #strParent==26-5)
----
assert(#'C:\\Root\\啊啊啊啊啊 Unicode'==31 and #'啊啊啊啊啊'==15)--3 bytes per character
strParent = fso:GetParentFolderName("C:\\Root\\啊啊啊啊啊 Unicode\\Folder")
assert(strParent=='C:\\Root\\啊啊啊啊\229' and #strParent==31-10)
----
strParent = fso:GetParentFolderName("C:\\Root\\ĀĒĪŌŪ Unicode     \\Folder")--cheat by appending 1 byte character
assert(strParent==[[C:\Root\ĀĒĪŌŪ Unicode]])

And I just begin to know FSO and just from your post~ Before that, I use lfs, cmd line and regex to handle task on file, directory path etc. And, recently found this, may help! Windows Shell Items: Lua parsing library - parse binary file directly to get info on various file format. Document is not complete, but could work.

robertlzj commented 2 years ago

Hi, @tatewise I got a workaround, but not sure if has limitations - I don't known about code point. Maybe only another workarounds just for GetParentFolderName(...)🤣 The key is to broke code point first, pass to FSO, then convert (assemble) result, as playing on words. Something like this, ugly, but works within your example (just GetParentFolderName)

string='C:\\Root\\ĀĒĪŌŪ Unicode\\Folder'
print(string)
print(string.byte(string,1,#string))
map={[196]=1,[146]=2,[170]=3,[197]=4,[140]=5,[128]=6,
    196,146,170,197,140,128
}
tem_str_bytes={}
index=1
while index<=#string do
    byte=string.byte(string,index)
    byte=map[byte] or byte
    assert(byte<=128,byte)
    table.insert(tem_str_bytes,string.char(byte))
    index=index+1
end
tem_str=table.concat(tem_str_bytes)
print(tem_str)
tem_strParent = fso:GetParentFolderName(tem_str)
print(string.byte(tem_strParent,1,#tem_strParent))
print(tem_strParent)
tem_str_bytes={}
index=1
while index<=#tem_strParent do
    byte=string.byte(tem_strParent,index)
    byte=map[byte] or byte
    table.insert(tem_str_bytes,string.char(byte))
    index=index+1
end
strParent=table.concat(tem_str_bytes)
print(strParent==[[C:\Root\ĀĒĪŌŪ Unicode]])

print output:

C:\Root\ĀĒĪŌŪ Unicode\Folder
67  58  92  82  111 111 116 92  196 128 196 146 196 170 197 140 197 170 32  85  110 105 99  111 100 101 92  70  111 108 100 101 114
C:\Root\ Unicode\Folder
67  58  92  82  111 111 116 92  1   6   1   2   1   3   4   5   4   3   32  85  110 105 99  111 100 101
C:\Root\ Unicode
C:\Root\ĀĒĪŌŪ Unicode
tatewise commented 2 years ago

There are probably many workarounds just for GetParentFolderName(...) but they do not work for GetFolder(...) or GetFile(...) or other methods where they must interact with actual folders or files. Your suggestion does not work in Lua 5.3 for those other methods let alone in Lua 5.1 ☹

robertlzj commented 2 years ago

OK, wish you good luck~ And, just mention again, if replaceable, some of the function about Folder object, File object (as I just take a glance) of FSO could implement by lua file syetem lfs, or command line invoked from io.popen etc. I had tried some of them in Lua 5.3. And there may be some fork of luacom. Hope you don't miss it. 😃 File object | Microsoft Docs

tatewise commented 2 years ago

Unfortunately, lfs and io.popen only support file paths using the 256 ANSI character set and do NOT support file paths containing any UTF-8 characters such as Ā Ē Ī Ō Ū, etc. I know lfs and the io library very well and used them until switching to luacom FileSystemObjects to handle UTF-8 file paths, but then ran into this issue when using Lua 5.1.

robertlzj commented 2 years ago

Wait, I use them under gbk (CP 936) system environment, which may handle many non-ANSI characters, too. In my practice, need to convert from utf8 to gbk (my system code page), then lfs will work! Tried a lot in Lua 5.3, not sure if in Lua 5.1.

local lfs=require'lfs'
local gbk=require'gbk'
a=lfs.attributes(gbk.fromutf8[[C:\Ā Ē Ī Ō Ū]])--from utf8 to gbk
assert(a)

So does all? io function!😁

I should had asked help for similar question a lot, maybe at lfs's issue page😂 Until you said UTF problems on lfs, I almost forget it, for I packaged gbk+lfs which won't notice the convert.

And another older method, save the script or just the path argument to ANSI, then import to lfs. Which will work too! The mess character is ‘ĀĒĪŌŪ', code in Lua 5.1 image

So, seems like, the lfs works on system code page which also contain ANSI basically, but not suitable for UTF8 - the script document file encode? Nice summary~🤣

tatewise commented 2 years ago

I need to be able to support all Unicode UTF-8 code points and not just a subset. It must also work on any other user system that I cannot control because my script is published for any user to download.

robertlzj commented 2 years ago

Oh, then try iconv, - convert between various encode, there is a lua bind on windows, but need to compile(and I'm not familiar), not had a try.

Or, could convert from utf8 to ANSI local code page - the 2nd method above. This would be easier? I didn't think deeply. maybe need convert too, for utf is compress (for transfer) on Unicode, which need convert from local character set first... Test on lua 5.1

local lfs=require'lfs'
a=lfs.attributes('\168\161 \168\165 \168\169 \168\173 \168\177')
--  ANSI (using local encode when beyond ASCII?): Ā Ē Ī Ō Ū, equal the mess code in the picture above
assert(a)
end

See this, mention PowerShell / iconv (command line tool). Contain file convert (I use SaveAs above), also string convert? A hard workaround maybe..

And many misstatement I have took, on UTF, Unicode and Character Set maybe, I'm lack of relate knowledge from now on... mark and learning.

robertlzj commented 2 years ago

Hi, there is another solution, utf8_filenames.lua. Not ideal for me, since I use 'gbk' convert. But could try.

tatewise commented 2 years ago

Unfortunately, that is NOT a general solution for arbitrary UTF-8 symbols because as its comments say: -- Please note that filenames must contain only symbols from your Windows ANSI codepage (which depends on OS locale). -- Unfortunately, it's impossible to work with a file having arbitrary UTF-8 symbols in its name. In other words, all it does is convert UTF-8 to ANSI for the 256 characters in the locale Code Page.

robertlzj commented 2 years ago

Yes, I see, very limitation 😂 Although gbk encode convert is enough for me , I'm still searching a general solution too. Seems iconv is the best solution I can find now..

robertlzj commented 2 years ago

Hi, I have built and test lua-iconv (based on [libiconv - GNU Project - Free Software Foundation (FSF)](http://www.gnu.org/software/libiconv/)) in Windows 10, with Lua 5.3, works fine, could have a try~

1linux commented 1 year ago

We ran in a similar problem: customers created file paths consisting of utf16 characters on a Windows machine. You can even get the path into a utf-8 string, which can be handeled by Lua with no problem. However the limiting factor is the C-Runtime Library (msvcrt) - you simply cannot access files who´s filenames are encoded in utf. On Windows.

Our solution was to write a Lua library. One function is like:

int file_get_contents(lua_State *L, const char*filename, int offset, int maxlen, int encoding) {
    luaL_Buffer luabuffer;
    unsigned char buff[1024];
    int wsz=0;
    wchar_t *winfilename=NULL;
    FILE *pf=NULL;
    if(encoding>0) {
        wsz=to_utf16(filename,encoding,&winfilename);
    } else {
        wsz = to_utf16(filename,CP_UTF8,&winfilename) || to_utf16(filename,CP_ACP,&winfilename);
    }
    if(!wsz) {
        lua_pushnil(L);
        lua_pushstring(L, "convert to windows utf-16 filename fail");
        return 2;
    }
    pf = _wfopen(winfilename, L"rb");
    free(winfilename);
    if (pf == NULL) {
        lua_pushnil(L);
        lua_pushstring(L, strerror(errno));
        return 2;
    }
    if(maxlen<=0) {
        fseek (pf, 0, SEEK_END); 
        maxlen=ftell(pf) - offset;
        if(maxlen<0) maxlen=0;
    }
    if(maxlen<=0) {
        fclose(pf);
        lua_pushstring(L, "");
        return 1;
    }
    luaL_buffinit(L, &luabuffer);
    fseek(pf,offset,SEEK_SET);
    while(maxlen>0) {
        int rs = maxlen > 1024 ? 1024 : maxlen;
        fread(buff,1,rs,pf);
        luaL_addlstring(&luabuffer, buff, rs);
        maxlen -= rs;
    }
    luaL_pushresult(&luabuffer);
    fclose(pf);
    return 1;
}