fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.02k stars 358 forks source link

Add ctime/mtime to list of expected values in info #526

Open martindurant opened 3 years ago

martindurant commented 3 years ago

Created and/or modified time is returned in the file info of most backends. We should endeavour to surface these in the file info dict with a common format (datetime.datetime? unix timestamp?) and key names.

e.g.,

--- a/fsspec/implementations/local.py
+++ b/fsspec/implementations/local.py
@@ -78,6 +78,8 @@ class LocalFileSystem(AbstractFileSystem):
                 result["size"] = out2.st_size
             except IOError:
                 result["size"] = 0
+        result['created'] = datetime.datetime.utcfromtimestamp(result["created"])
+        result['modified'] = datetime.datetime.utcfromtimestamp(result["mtime"])
         return result
martindurant commented 3 years ago

Marked as "good first issue" because this should be simple per implementation, but there are quite a few implementations to go through.

ap-- commented 8 months ago

A list of filesystems and their info keys

I collected some about the .info() dicts of the different filesystems. Posting it here in case it might be useful:

AbstractFileSystem

"name", "size", "type"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/spec.py#L669-L670

arrow

"name", "size", "type", "mtime" (datetime | float | None)

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/arrow.py#L101-L118

https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileInfo.html#pyarrow.fs.FileInfo

dask

returns whatever the remote fs returns.

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/dask.py#L93-L97

data

"name", "size", "type", "mimetype"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/data.py#L31-L35

dbfs

"name", "size", "type"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/dbfs.py#L84-L90

dirfs

returns whatever the remote fs returns.

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/dirfs.py#L233-L241

ftp

"name", "size", "type", "modify", "unix.owner", "unix.group", "unix.mode", and other returned via FTP.mlsd()

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/ftp.py#L100-L118

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/ftp.py#L370-L384

git

"name", "size", "type", "hex", "mode" # mode is octal str, hex is str?

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/git.py#L90-L96

github

"name", "size", "type", "sha", "mode" # mode is octal str, sha is str

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/github.py#L167-L178

http

"name", "size", "type", "mimetype", "ETag", "Content-MD5", "Digest"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/http.py#L190-L194

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/http.py#L838-L856

jupyter

"name", "size", "type", "last_modified", "created", "format", "mimetype", "writable"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/jupyter.py#L47-L57

example:

{
    "name": "slurm-22382538.out",
    "last_modified": "2024-02-09T13:03:30.773865Z",
    "created": "2024-02-09T13:03:30.773865Z",
    "format": null,
    "mimetype": null,
    "size": 2896,
    "writable": true,
    "type": "file"
}

libarchive

"name", "size", "type", "created", "mode", "uid", "gid", "mtime"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/libarchive.py#L165-L172

libarchive mappings:

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/libarchive.py#L145-L153

local

"name", "size", "type", "created", "isLink", "mode", "uid", "gid", "mtime", "ino", "nlink", "destination"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/local.py#L97-L112

memory

"name", "size", "type", "created"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/memory.py#L41-L47

reference

"name", "size", "type"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/reference.py#L224-L235

sftp

"name", "size", "type", "uid", "gid", "time", "mtime"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/sftp.py#L108-L120

smb

"name", "size", "type", "uid", "gid", "time", "mtime"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/smb.py#L168-L176

tar

"name", "size", "type", "mode", "uid", "gid", "mtime", "chksum", "linkname", "uname", "gname", "devmajor", "devminor"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/tar.py#L112-L116

example:

_ = {
    'name': 'somefile.md',
    'mode': 420,
    'uid': 501,
    'gid': 20,
    'size': 382,
    'mtime': 1707314187,
    'chksum': 8314,
    'type': 'file',
    'linkname': '',
    'uname': 'andreaspoehlmann',
    'gname': 'staff',
    'devmajor': 0,
    'devminor': 0
}

webhdfs

"name", "size", "type", "accessTime", "blockSize", "group", "modificationTime", "owner", "pathSuffix", "permission", "replication"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/webhdfs.py#L266-L270

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#FileStatus

zip

"name", "size", "type"

https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/zip.py#L100-L104

adlfs

"name", "size", "type", "metadata", "creation_time", "deleted", "deleted_time", "last_modified", "content_time", "content_settings", "remaining_retention_days", "archive_status", "last_accessed_on", "etag", "tags", "tag_count", "version_id", "is_current_version"

https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L49-L67

https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L829C13-L846

gcsfs

https://cloud.google.com/storage/docs/json_api/v1/objects#resource

https://github.com/fsspec/gcsfs/blob/f526d96860c1422e7b4599b70b267607dae1af8a/gcsfs/core.py#L465-L477

s3fs

"name", "size", "type", "StorageClass", "VersionId", "ContentType", "ETag", "LastModified"

https://github.com/fsspec/s3fs/blob/74f4d95a62d7339a1af12db4339f22c5f3d73670/s3fs/core.py#L1310-L1319

alluxio

"name", "size", "type", "last_modification_time_ms"

https://github.com/fsspec/alluxiofs/blob/33489bcea618d6e934e5227be77be75b5ca105ff/alluxiofs/core.py#L134-L149

wandb

"name", "size", "type", "md5", "mimetype"

https://github.com/jkulhanek/wandbfs/blob/ccc7e4dceb45070de8c440b44ddee96fdd348057/wandbfs/_wandbfs.py#L63-L68

oci

"name", "size", "type", "etag", "md5", "timeCreated", "timeModified", "storageTier", "archivalState"

https://github.com/oracle/ocifs/blob/f0e1d3b7b26bc1c1b010abb11df6cd06ac318ed3/ocifs/core.py#L498-L509

asynclocal

same as local

gdrive

"name", "size", "type", and other returned via ??? https://developers.google.com/drive/api/reference/rest/v3/files#File

https://github.com/fsspec/gdrivefs/blob/8bbfa457605d60d40d2b09c8c93d493cf543100e/gdrivefs/core.py#L157-L160

dropbox

"name", "size", "type", and all public attr from FileMetadata

https://dropbox-sdk-python.readthedocs.io/en/latest/api/files.html#dropbox.files.FileMetadata

https://github.com/fsspec/dropboxdrivefs/blob/23463258eca49c10d77de33e9d07e4ee5caa090c/dropboxdrivefs/core.py#L163-L176

oss

"name", "size", "type", "LastModified"

https://github.com/fsspec/ossfs/blob/016ccbad6b90fe02cf613582bb8db3bb101f4438/src/ossfs/base.py#L186-L199

webdav

"name", "size", "type" and others returned via

_ = {
    'name': '/',
    'href': '/',
    'size': None,
    'created': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=tzutc()),
    'modified': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=datetime.timezone.utc),
    'content_language': None,
    'content_type': None,
    'etag': None,
    'type': 'directory',
    'display_name': 'test_storage_options0'
}

https://github.com/skshetry/webdav4/blob/4c2046e2250f001bdad76541c0e877e4b40c332e/src/webdav4/fsspec.py#L51-L57

https://github.com/skshetry/webdav4/blob/4c2046e2250f001bdad76541c0e877e4b40c332e/src/webdav4/client.py#L54-L65

dvc

"name", "size", "type", "md5", "md5-dos2unix", "dvc_info", "isdvc", "isout", "fs_info", "isexec", "repo"

https://github.com/iterative/dvc/blob/953ae56536f03d915f396cd6cafd89aaa54fafc5/dvc/fs/dvc.py#L41-L69

root

"name", "size", "type"

https://github.com/CoffeaTeam/fsspec-xrootd/blob/f8c57cd7b0361425ee08a77096dd642ddeb1d987/src/fsspec_xrootd/xrootd.py#L320-L338

box

"name", "size", "type", "id", "modified_at", "created_at"

https://github.com/IBM/boxfs/blob/718fb0071d20a7004f44fe2fa0eac26dc9c3d5d5/src/boxfs/boxfs.py#L395-L402

lakefs

"name", "size", "type", "content-type", "checksum", "mtime"

https://github.com/aai-institute/lakefs-spec/blob/f05c5b6c57547e9f169e3b9c4ed5346f2d65bf35/src/lakefs_spec/spec.py#L356-L363

martindurant commented 8 months ago

Thank you, @ap-- , that is very useful. Also worth adding that some backends that don't really have directories will make fake info dicts for those directories, typically with {"name": "...", "size": 0, "type": "dictionary"}.

Your list makes it sound like any FS could do with a add_standard_info_fields(info_dict) static method, where we decide what those standard fields are. For example, converting whatever time unit is expected to a standard representation, which would help for rsync() in particular.

ap-- commented 8 months ago

Yes that would be a great step towards standardizing the info_dict.

AbstractFileSystem could even have a default implementation, that tries various different aliases for getting mtime (and potentially others), as well as conversions to the standard datatype (i.e. like this ).

For completeness I'm cross-referencing barneygale/pathlib-abc#3 . I started looking into this, because I need to convert info_dicts into an os.stat_result compatible type for universal_pathlib.

dholth commented 7 months ago

While you're at it, the nanoseconds instead of float times would be good. https://docs.python.org/3/library/os.html#os.stat_result.st_mtime_ns