Open martindurant opened 3 years ago
Marked as "good first issue" because this should be simple per implementation, but there are quite a few implementations to go through.
I collected some about the .info()
dicts of the different filesystems.
Posting it here in case it might be useful:
"name", "size", "type"
"name", "size", "type", "mtime" (datetime | float | None)
https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileInfo.html#pyarrow.fs.FileInfo
returns whatever the remote fs returns.
"name", "size", "type", "mimetype"
"name", "size", "type"
returns whatever the remote fs returns.
"name", "size", "type", "modify", "unix.owner", "unix.group", "unix.mode", and other returned via FTP.mlsd()
"name", "size", "type", "hex", "mode" # mode is octal str, hex is str?
"name", "size", "type", "sha", "mode" # mode is octal str, sha is str
"name", "size", "type", "mimetype", "ETag", "Content-MD5", "Digest"
"name", "size", "type", "last_modified", "created", "format", "mimetype", "writable"
example:
{
"name": "slurm-22382538.out",
"last_modified": "2024-02-09T13:03:30.773865Z",
"created": "2024-02-09T13:03:30.773865Z",
"format": null,
"mimetype": null,
"size": 2896,
"writable": true,
"type": "file"
}
"name", "size", "type", "created", "mode", "uid", "gid", "mtime"
libarchive mappings:
"name", "size", "type", "created", "isLink", "mode", "uid", "gid", "mtime", "ino", "nlink", "destination"
"name", "size", "type", "created"
"name", "size", "type"
"name", "size", "type", "uid", "gid", "time", "mtime"
"name", "size", "type", "uid", "gid", "time", "mtime"
"name", "size", "type", "mode", "uid", "gid", "mtime", "chksum", "linkname", "uname", "gname", "devmajor", "devminor"
example:
_ = {
'name': 'somefile.md',
'mode': 420,
'uid': 501,
'gid': 20,
'size': 382,
'mtime': 1707314187,
'chksum': 8314,
'type': 'file',
'linkname': '',
'uname': 'andreaspoehlmann',
'gname': 'staff',
'devmajor': 0,
'devminor': 0
}
"name", "size", "type", "accessTime", "blockSize", "group", "modificationTime", "owner", "pathSuffix", "permission", "replication"
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#FileStatus
"name", "size", "type"
"name", "size", "type", "metadata", "creation_time", "deleted", "deleted_time", "last_modified", "content_time", "content_settings", "remaining_retention_days", "archive_status", "last_accessed_on", "etag", "tags", "tag_count", "version_id", "is_current_version"
https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L49-L67
https://cloud.google.com/storage/docs/json_api/v1/objects#resource
"name", "size", "type", "StorageClass", "VersionId", "ContentType", "ETag", "LastModified"
"name", "size", "type", "last_modification_time_ms"
"name", "size", "type", "md5", "mimetype"
"name", "size", "type", "etag", "md5", "timeCreated", "timeModified", "storageTier", "archivalState"
same as local
"name", "size", "type", and other returned via ??? https://developers.google.com/drive/api/reference/rest/v3/files#File
"name", "size", "type", and all public attr from FileMetadata
https://dropbox-sdk-python.readthedocs.io/en/latest/api/files.html#dropbox.files.FileMetadata
"name", "size", "type", "LastModified"
"name", "size", "type" and others returned via
_ = {
'name': '/',
'href': '/',
'size': None,
'created': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=tzutc()),
'modified': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=datetime.timezone.utc),
'content_language': None,
'content_type': None,
'etag': None,
'type': 'directory',
'display_name': 'test_storage_options0'
}
"name", "size", "type", "md5", "md5-dos2unix", "dvc_info", "isdvc", "isout", "fs_info", "isexec", "repo"
https://github.com/iterative/dvc/blob/953ae56536f03d915f396cd6cafd89aaa54fafc5/dvc/fs/dvc.py#L41-L69
"name", "size", "type"
"name", "size", "type", "id", "modified_at", "created_at"
"name", "size", "type", "content-type", "checksum", "mtime"
Thank you, @ap-- , that is very useful. Also worth adding that some backends that don't really have directories will make fake info dicts for those directories, typically with {"name": "...", "size": 0, "type": "dictionary"}
.
Your list makes it sound like any FS could do with a add_standard_info_fields(info_dict)
static method, where we decide what those standard fields are. For example, converting whatever time unit is expected to a standard representation, which would help for rsync() in particular.
Yes that would be a great step towards standardizing the info_dict.
AbstractFileSystem could even have a default implementation, that tries various different aliases for getting mtime (and potentially others), as well as conversions to the standard datatype (i.e. like this ).
For completeness I'm cross-referencing barneygale/pathlib-abc#3 . I started looking into this, because I need to convert info_dicts into an os.stat_result compatible type for universal_pathlib.
While you're at it, the nanoseconds instead of float times would be good. https://docs.python.org/3/library/os.html#os.stat_result.st_mtime_ns
Created and/or modified time is returned in the file info of most backends. We should endeavour to surface these in the file info dict with a common format (
datetime.datetime
? unix timestamp?) and key names.e.g.,