emscripten-core / emscripten

Emscripten: An LLVM-to-WebAssembly Compiler
Other
25.8k stars 3.31k forks source link

stat has been polluted, emscripten using it's own define but user program using system lib define #20840

Open oatgnauh opened 11 months ago

oatgnauh commented 11 months ago

Version of emscripten/emsdk: 3.1.34 I am writting a function to check the given path is a file, here is code:

bool isFile(const std::string& filename) {
    struct stat statbuf;
    auto ret = lstat(filename.c_str(), &statbuf);

    VELOGI("buf:%p, exists:%d, buf.st_mode:%d, buf.st_size:%lld,sizeptr:%p, diff:%d, st_dev:%d, st_ino:%d, st_mode:%d", 
    &statbuf, ret, statbuf.st_mode, statbuf.st_size, &(statbuf.st_size), offsetof(struct stat, st_size), offsetof(struct stat, st_dev), offsetof(struct stat, st_ino), offsetof(struct stat, st_mode))
    if(ret == 0)
        return S_ISREG(statbuf.st_mode) != 0;

    return false;
}

then I find out that given a file path, return false, pretty odd. Thus I print addr of stat, both my code and emscripten inside is the same, check the image below: image image

which show different offset of the same member. and I alse find out that accorrding to the offset of st_mode and st_ino, emscripten is using it's own stat definition instead of <sys/stat.h>, check below: image

Thus, st_mode and st_size is correct in emscripten print, but wrong at my code. you can see stat defined in system lib /usr/include/x86_64-linux-gnu/bits/stat.h

struct stat
  {
    __dev_t st_dev;     /* Device.  */
#ifndef __x86_64__
    unsigned short int __pad1;
#endif
#if defined __x86_64__ || !defined __USE_FILE_OFFSET64
    __ino_t st_ino;     /* File serial number.  */
#else
    __ino_t __st_ino;           /* 32bit file serial number.    */
#endif
#ifndef __x86_64__
    __mode_t st_mode;           /* File mode.  */
    __nlink_t st_nlink;         /* Link count.  */
#else
    __nlink_t st_nlink;     /* Link count.  */
    __mode_t st_mode;       /* File mode.  */
#endif
    __uid_t st_uid;     /* User ID of the file's owner. */
    __gid_t st_gid;     /* Group ID of the file's group.*/
#ifdef __x86_64__
    int __pad0;
#endif
    __dev_t st_rdev;        /* Device number, if device.  */
#ifndef __x86_64__
    unsigned short int __pad2;
#endif
#if defined __x86_64__ || !defined __USE_FILE_OFFSET64
    __off_t st_size;            /* Size of file, in bytes.  */
#else
    __off64_t st_size;          /* Size of file, in bytes.  */
#endif
    __blksize_t st_blksize; /* Optimal block size for I/O.  */
#if defined __x86_64__  || !defined __USE_FILE_OFFSET64
    __blkcnt_t st_blocks;       /* Number 512-byte blocks allocated. */
#else
    __blkcnt64_t st_blocks;     /* Number 512-byte blocks allocated. */
#endif
#ifdef __USE_XOPEN2K8
    /* Nanosecond resolution timestamps are stored in a format
       equivalent to 'struct timespec'.  This is the type used
       whenever possible but the Unix namespace rules do not allow the
       identifier 'timespec' to appear in the <sys/stat.h> header.
       Therefore we have to handle the use of this header in strictly
       standard-compliant sources special.  */
    struct timespec st_atim;        /* Time of last access.  */
    struct timespec st_mtim;        /* Time of last modification.  */
    struct timespec st_ctim;        /* Time of last status change.  */
# define st_atime st_atim.tv_sec    /* Backward compatibility.  */
# define st_mtime st_mtim.tv_sec
# define st_ctime st_ctim.tv_sec
#else
    __time_t st_atime;          /* Time of last access.  */
    __syscall_ulong_t st_atimensec; /* Nscecs of last access.  */
    __time_t st_mtime;          /* Time of last modification.  */
    __syscall_ulong_t st_mtimensec; /* Nsecs of last modification.  */
    __time_t st_ctime;          /* Time of last status change.  */
    __syscall_ulong_t st_ctimensec; /* Nsecs of last status change.  */
#endif
#ifdef __x86_64__
    __syscall_slong_t __glibc_reserved[3];
#else
# ifndef __USE_FILE_OFFSET64
    unsigned long int __glibc_reserved4;
    unsigned long int __glibc_reserved5;
# else
    __ino64_t st_ino;           /* File serial number.  */
# endif
#endif
  };

How can I fix it? I try to using

#include<bits/alltypes.h>
#include <bits/stat.h>

but comes with some compile error image I don't think it is good idea to solve it. pls help~

xehp commented 11 months ago

If host is 64 bit so there that stat struct probably uses 64 bit variables while perhaps emscripten use 32 bit (or the other way around). Why the emscripten version has padding I can not say but that seems to be the case.

Perhaps do a hexdump of the entire struct to see what that looks like.

One can experiment with printing the addresses. That will tell where they are expected (although not where they actually are). printf("sb.st_mode %x %p %p\n", sb.st_mode, &sb, &sb.st_mode);

When I try it this, tells me that in one st_mode is expected at offset 0xC and at offset 0x18 in the other.

To be more precise you may need to define a guest struct for stat and copy the content data member by member:

So these members:

struct translated_stat { dev_t st_dev // Device ID of device containing file. (perhaps padding here or not) ino_t st_ino // File serial number. mode_t st_mode // Mode of file (see below).

and so on.

Copied one by one.

host.st_dev = guest.st_dev; host.st_ino = guest.st_ino; host.st_mode = guest.st_mode;

and so on.

Or copy it (memcpy style) to the translate_struct and then data appears as aligned. Or I totally missed what the question was.

sbc100 commented 11 months ago

Indeed emscripten defines all its own types such as struct stat.

System headers such as /usr/include/x86_64-linux-gnu/bits/stat.h, just like system libraries, cannot be used with emscripten. Those are designed for the host system, and won't work with emscripten or any other cross compiler.

oatgnauh commented 11 months ago

Indeed emscripten defines all its own types such as struct stat.

System headers such as /usr/include/x86_64-linux-gnu/bits/stat.h, just like system libraries, cannot be used with emscripten. Those are designed for the host system, and won't work with emscripten or any other cross compiler.

so, If I have to use such system libraries, the inlcuding headers must be replace <sys/XX.h> with <bits/XX.h> ? otherwise I have to transfer emscripten's define to system define, just like:

struct stat emsdk;
lstat(filename.c_str(), &emsdk);

struct stat system;
system.st_dev = emsdk.st_dev;
system.st_mode = emsdk.st_mode;
...

This seems very inelegant

sbc100 commented 11 months ago

I'm not sure what you are asking but I think the answer is no. Both your system compiler and emscripten contains defintions of struct stat and the lstat function. In both cases you should include <sys/stat.h> I believe. But emscripten's version of this header is specific to emscripten and your OS version is specific to your OS. You can't mix them in single program.

You can see both stat and lstat defined in emscripten here: https://github.com/emscripten-core/emscripten/blob/e57438135ed9f8ded5f01cea124d98156bb29c7f/system/lib/libc/musl/include/sys/stat.h#L73-L75

oatgnauh commented 11 months ago

I'm not sure what you are asking but I think the answer is no. Both your system compiler and emscripten contains defintions of struct stat and the lstat function. In both cases you should include <sys/stat.h> I believe. But emscripten's version of this header is specific to emscripten and your OS version is specific to your OS. You can't mix them in single program.

You can see both stat and lstat defined in emscripten here:

https://github.com/emscripten-core/emscripten/blob/e57438135ed9f8ded5f01cea124d98156bb29c7f/system/lib/libc/musl/include/sys/stat.h#L73-L75

sorry for border you, I should make my question clear. my problem is: When I gave a legal path to lstat, I expected stat.st_size to be greater than zero, but in fact it is not. And when I printed other variables in the stat structure,

printf("buf:%p, exists:%d, buf.st_mode:%d, buf.st_size:%lld,sizeptr:%p, diff:%d, st_dev:%d, st_ino:%d, st_mode:%d", 
    &statbuf, ret, statbuf.st_mode, statbuf.st_size, &(statbuf.st_size), offsetof(struct stat, st_size), offsetof(struct stat, st_dev), offsetof(struct stat, st_ino), offsetof(struct stat, st_mode))

I found that the values of the same members were printed inconsistently in the user code and emsdk's stat implementation. (you can check the print result on my first post img.

To put it simply, I cannot use the stat function normally. Because I saw that the distribution of stat members inside emscripten is inconsistent with the system library, I suspect that there is a problem with the data conversion here.

I wonder inside __syscall_newfstatat,

int __syscall_newfstatat(int dirfd, intptr_t path, intptr_t buf, int flags) {
  // Only accept valid flags.
  if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_SYMLINK_NOFOLLOW)) {
    // TODO: Test this case.
    return -EINVAL;
  }
  auto parsed = path::getFileAt(dirfd, (char*)path, flags);
  if (auto err = parsed.getError()) {
    return err;
  }
  auto file = parsed.getFile();

  // Extract the information from the file.
  auto lockedFile = file->locked();
  auto buffer = (struct stat*)buf;

  off_t size = lockedFile.getSize();
  if (size < 0) {
    return size;
  }
  buffer->st_size = size;

  // ATTN: hard-coded constant values are copied from the existing JS file
  // system. Specific values were chosen to match existing library_fs.js
  // values.
  // ID of device containing file: Hardcode 1 for now, no meaning at the
  // moment for Emscripten.
  buffer->st_dev = 1;
  buffer->st_mode = lockedFile.getMode();
  buffer->st_ino = file->getIno();
  // The number of hard links is 1 since they are unsupported.
  buffer->st_nlink = 1;
  buffer->st_uid = 0;
  buffer->st_gid = 0;
  // Device ID (if special file) No meaning right now for Emscripten.
  buffer->st_rdev = 0;
  // The syscall docs state this is hardcoded to # of 512 byte blocks.
  buffer->st_blocks = (buffer->st_size + 511) / 512;
  // Specifies the preferred blocksize for efficient disk I/O.
  buffer->st_blksize = 4096;
  buffer->st_atim.tv_sec = lockedFile.getATime();
  buffer->st_mtim.tv_sec = lockedFile.getMTime();
  buffer->st_ctim.tv_sec = lockedFile.getCTime();
  return __WASI_ERRNO_SUCCESS;
}

auto buffer = (struct stat*)buf; Is this buffer defined by emsdk itself or the system library?

oatgnauh commented 11 months ago

If host is 64 bit so there that stat struct probably uses 64 bit variables while perhaps emscripten use 32 bit (or the other way around). Why the emscripten version has padding I can not say but that seems to be the case.

Perhaps do a hexdump of the entire struct to see what that looks like.

One can experiment with printing the addresses. That will tell where they are expected (although not where they actually are). printf("sb.st_mode %x %p %p\n", sb.st_mode, &sb, &sb.st_mode);

When I try it this, tells me that in one st_mode is expected at offset 0xC and at offset 0x18 in the other.

To be more precise you may need to define a guest struct for stat and copy the content data member by member:

So these members:

struct translated_stat { dev_t st_dev // Device ID of device containing file. (perhaps padding here or not) ino_t st_ino // File serial number. mode_t st_mode // Mode of file (see below).

and so on.

Copied one by one.

host.st_dev = guest.st_dev; host.st_ino = guest.st_ino; host.st_mode = guest.st_mode;

and so on.

Or copy it (memcpy style) to the translate_struct and then data appears as aligned. Or I totally missed what the question was.

I have print st_size address, you can check the image i post first. key word is sizeptr, both print inside emscripten and my code

sbc100 commented 11 months ago

When I gave a legal path to lstat

What do you mean by legal path here? Are you sure the file exists in the emscripten virtual file system? Are you aware that emscripten does not have direct access to the host filesystem. See: https://emscripten.org/docs/api_reference/Filesystem-API.html

oatgnauh commented 11 months ago

When I gave a legal path to lstat

What do you mean by legal path here? Are you sure the file exists in the emscripten virtual file system? Are you aware that emscripten does not have direct access to the host filesystem. See: https://emscripten.org/docs/api_reference/Filesystem-API.html

Yes I am sure, my application is running on browser, the path I passing is a FS file path. And I can use js interface FS to verify it's legality. I am quite sure the path I pass to lstat is a file image

sbc100 commented 11 months ago

Can you check what value is being returned by lstat? Is it returning zero?

What is VEFS in the above screenshot?

sbc100 commented 11 months ago

Oh wait I can see in your screenshot that its printing exists: false which means stat is indeed returning zero.

Going back to the original question, are are thinking that your user code is not using emscripten stat struct layout?

sbc100 commented 11 months ago

If you are somehow including /usr/include/x86_64-linux-gnu/bits/stat.h that would be real source of problem. That header should never be included in any emscripten-built program. That is a host-specific header and won't work with emscripten.