fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
211 stars 36 forks source link

Different return types for S3Path.stat() and WindowsUPath.stat() #145

Closed theogaraj closed 4 months ago

theogaraj commented 9 months ago

Which operating system and Python version are you using? Windows 11, Python 3.9.6

Which version of this project are you using? 0.1.3

What did you do? I am attempting to use universal_pathlib in order to have a unified way of handling files whether they are local or in S3. One of the things I need to do is get all the files in a folder (yes I know S3 doesn't have actual folders, but hopefully you understand what I mean) and get their sizes.

>>> from upath import UPath
>>> lpath = UPath('local_data/files/')
>>>spath = UPath('s3://test_bucket/files/')
>>> lfiles = lpath.iterdir()
>>> lf = next(lfiles)
>>> type(lf.stat())
<class 'os.stat_result'>
>>> sfiles = spath.iterdir()
>>> sf = next(sfiles)
>>> type(sf.stat())
<class 'dict'>
>>> lf.stat().st_size
1657
>>> sf.stat().st_size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'dict' object has no attribute 'st_size'

What did you expect to see? I expected to be able to use same code to get file sizes whether they are in local directory or in S3

What did you see instead? The return types of stat() are different for S3Path vs WindowsUPath and I can't get file size in the same way from each.

Would this difference in behavior be something you would consider reconciling? Alternatively, do you have suggestions on another approach to achieving what I'm trying to do?

ap-- commented 9 months ago

Hi @theogaraj

This is definitely an issue, and we should create a UPathStatResult class (name up for discussion) that provides an interface compatible with dict (fsspec) and os.stat_result

We need to ensure that the os.stat_result attribute names map to the equivalent type for each of the fsspec filesystems, too.

relevant fsspec issues:


For now I don't have a good recommendation other than

_stat = pth.stat()
size = _stat["size"] if isinstance(_stat, dict) else _stat.st_size

Cheers, Andreas

ap-- commented 9 months ago

I vaguely remembered that I implemented something related to this already...

getmtime on fsspec

https://github.com/mdshw5/pyfaidx/blob/cac82f24e9c4e334cf87a92e477b92d4615d260f/pyfaidx/__init__.py#L1318-L1345

theogaraj commented 9 months ago

@ap-- thank you for checking and responding so promptly. Because of another problem I had with iterdir (logged as https://github.com/fsspec/universal_pathlib/issues/146) I ended up going with the slightly clunkier spath.fs.ls(str(spath)) and then accessing the ['size] attribute common to both local and S3.

So this is by no means a showstopper. I'll track this issue and can update my code whenever someone is able to resolve these two issues.

bolkedebruin commented 7 months ago

Just ran across this. See https://github.com/apache/airflow/blob/main/airflow/io/store/stat.py for a stat compatible version.

asford commented 5 months ago

I just ran across the same issue here and would be happy to fire a PR with a fix, though it looks like @ap-- may already be investigating?

@bolkedebruin make several updates to the airflow io provider while inheriting from UPath in https://github.com/apache/airflow/pull/35612 which are great prior art here.

I would propose that we just port the changes to support stat into UPath, which is now under:

ap-- commented 4 months ago

Collected Info

For links to info() dicts of a lot of FileSystem implementations check https://github.com/fsspec/filesystem_spec/issues/526#issuecomment-1936188996

All filesystems have "name", "size" and "type".

For translating to os.stat_result.st_* attributes, these are the keys that could be checked:

Attribute Possible Info Keys
st_mode mode, unix.mode, writable, isLink, nlink, permission, isexec
st_ino ino, name, id, sha, hex, Digest?
st_dev
st_nlink nlink, isLink
st_uid uid, owner, uname, unix.owner
st_gid gid, group, gname, unix.group
st_size size
st_atime time, last_accessed_on, accessTime
st_mtime mtime, last_modified, last_modification_time_ms, timeModified, modify, modificationTime, LastModified, modified_at
st_ctime
st_birthtime created, creation_time, timeCreated, created_at

Additional info to be considered comes from specific filesystems:

All data types need to be normalized to int