diskfs / go-diskfs

MIT License
494 stars 112 forks source link

squashfs: read target of symlinks #200

Closed ncw closed 8 months ago

ncw commented 8 months ago

I'd like to be able to read the target of symlinks.

I can think of three methods to do this. My favourite is method 2 I think.

I'm happy to make a PR for whichever method we decide is best.

method 1

One way of doing this would be to add a Readlink() (string, error) method to FileStat. This would return an error for non symlinks.

https://github.com/diskfs/go-diskfs/blob/3fa702b44f0b367810bee40d609b71ebcabd7713/filesystem/squashfs/directoryentry.go#L8-L13

This could be implemented by adding a target field and setting it here

https://github.com/diskfs/go-diskfs/blob/3fa702b44f0b367810bee40d609b71ebcabd7713/filesystem/squashfs/squashfs.go#L449-L452

method 2

An alternative would be to add a new method to directoryEntry (which is passed to the caller as an os.FileInfo), say Readlink() (string, error) which you'd have to type assert to have access to. This would have the advantage that we don't need to put stuff in the FileStat the user isn't going to use.

https://github.com/diskfs/go-diskfs/blob/3fa702b44f0b367810bee40d609b71ebcabd7713/filesystem/squashfs/directoryentry.go#L49-L59

If you like this concept then we could potentially implement an OpenFile method too which would save a lot of directory traversals when reading files when it is very likely you've just listed the directory and have a FileInfo to hand.

method 3

A third method would be to implement FileSystem.Readlink to mimic os.Readlink. I like this method the least though because squashfs doesn't have an efficient way of finding a file other than iterating all the directories.

deitch commented 8 months ago

I don't love method 2, in that you need to typecast to get it to work. I get it, sometimes you need to do such things, but within a squashfs, I would think you shouldn't need to do so.

Method 1 is pretty good; the amount of information you are carrying around is tiny. I might be worried if we were doing this on real-time systems, but who runs go on microcontrollers?

Method 3 is most like os.Readlink(), and so remains consistent with the go way of doing things (even if sometimes that go way is less efficient).

an efficient way of finding a file other than iterating all the directories

And what happens when you try to get the FileInfo for method 1 or directory entry for method 2? You still have to walk the directories. The issue is that now you have to do it twice: once to get to the point that you know it is a symlink, and a second time to read the symlink. That part does bother me. I suspect go did it that way because not everything has symlinks, so they wanted it out of band.

I think method 1 is your best bet. You already are there, you walked the tree, and the amount of additional information is tiny. I could be convinced otherwise, though.

ncw commented 8 months ago

I don't love method 2, in that you need to typecast to get it to work. I get it, sometimes you need to do such things, but within a squashfs, I would think you shouldn't need to do so.

I'll just note that you already need a typecast to turn the result of the Sys() call into the FileStat.

Method 1 is pretty good; the amount of information you are carrying around is tiny. I might be worried if we were doing this on real-time systems, but who runs go on microcontrollers?

Method 3 is most like os.Readlink(), and so remains consistent with the go way of doing things (even if sometimes that go way is less efficient).

Just an idea...The FileState could just contain a pointer to the directoryEntry...

In fact the FileStat could be an alias of directoryEntry (you can even make it backwards compatible with type aliases) and then we'd just add new methods to directoryEntry and they would appear on the FileStat also.

So

type FileStat = directoryEntry

Then move the Uid Gid and Xattrs methods onto dirEntry. This then makes adding Readlink quite natural and implements methods 1 and 2!

an efficient way of finding a file other than iterating all the directories

And what happens when you try to get the FileInfo for method 1 or directory entry for method 2? You still have to walk the directories. The issue is that now you have to do it twice: once to get to the point that you know it is a symlink, and a second time to read the symlink. That part does bother me. I suspect go did it that way because not everything has symlinks, so they wanted it out of band.

I think they were just following the unix syscalls which is readlink.

This is quite annoying though in general. From having implemented this many times in rclone I think the best interface is something like this:

// Node represents either a directory (*Dir) or a file (*File)
type Node interface {
    os.FileInfo
    IsFile() bool
    Inode() uint64
    SetModTime(modTime time.Time) error
    Sync() error
    Remove() error
    RemoveAll() error
    DirEntry() fs.DirEntry
    VFS() *VFS
    Open(flags int) (Handle, error)
    Truncate(size int64) error
    Path() string
    SetSys(interface{})
}

(With the addition of ReadLink!)

https://github.com/rclone/rclone/blob/8503282a5adffc992e1834eed2cd8aeca57c01dd/vfs/vfs.go#L51C1-L66C2

deitch commented 8 months ago

From having implemented this many times in rclone

Oh rclone is really nice. I didn't know that was you. I have some other projects that are in a similar space, but not really competitive. We should have a real-world discussion. Send me a Twitter DM @avideitcher or just @ me on one and we can then connect via email. If you are interested, of course.

Do you remember the seminal "rsync for backup" paper, by Mike Rubel, all those years ago? So much can be done with the basic structures at hand.

In fact the FileStat could be an alias of directoryEntry (you can even make it backwards compatible with type aliases) and then we'd just add new methods to directoryEntry and they would appear on the FileStat also.

Ooh, this is interesting. As long as it implements the right interface, why not? I rather like that. Does it work?

Since we already are talking about these, this was part of the work I did for a company early this year.

ncw commented 8 months ago

Thanks for merging this. I'll send the next part shortly which implements Open - this helps a lot in the performance stakes.

Oh rclone is really nice. I didn't know that was you. I have some other projects that are in a similar space, but not really competitive. We should have a real-world discussion. Send me a Twitter DM @avideitcher or just @ me on one and we can then connect via email. If you are interested, of course.

You can find me at nick@craig-wood.com also @njcw on twitter (I have followed you!). Yes I agree we have lots of interests in common and a discussion would be great!

Do you remember the seminal "rsync for backup" paper, by Mike Rubel, all those years ago? So much can be done with the basic structures at hand.

I don't think I saw that, but I'm a big fan of rsync! The text UI of rclone is modeled around rsync and the philosophy is very similar. The only bit rclone doesn't implement is the very clever delta encoding transfer, but there aren't many cloud providers which do that!

Since we already are talking about these, this was part of the work I did for a company early this year.

Oh nice!

I'm currently working on an archive backend for rclone which enables reading of archives like .zip and .squashfs - that is why I've been so interested in squashfs.

You can see the code so far here

v1.66.0-beta.7581.1cee1e101.archive-backend on branch archive-backend

To tell rclone you want to look at an archive you do rclone ls :archive:/path/to/archive.sqfs

deitch commented 8 months ago

To tell rclone you want to look at an archive you do rclone ls :archive:/path/to/archive.sqfs

Sometimes I think 50% of the things I get involved with in one way or another are, "how do we encode something that is kind of like a URL in a single string?" 😆

Will email you.