MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
46 stars 22 forks source link

improve handling of symlinks in poll OR re-introduce get (or equivalent functionality) for polls #1298

Open reidsunderland opened 1 week ago

reidsunderland commented 1 week ago

I'm mostly documenting that there is at least one case where get is useful, not necessarily saying we need to work on re-implementing it.

We just had a problem with a v2 SFTP poll, where we are polling a directory that contains many symlinks to files.

Previously, this directory presumably did not contain symlinks, so the poll worked fine. Sarra v2 would list the directory, the directory listing would report the actual file sizes, and everything worked.

The source made a change, so that the directory listing reports the size of the symlinks in the directory.

e.g.

ls /somedir/
100 file1
100 file2

Which is much smaller than the actual file sizes. I worked around this problem by migrating the v2 sarra to sr3 and turning on acceptSizeWrong.

If we do ls /somedir/*, the SFTP server would return the correct file sizes:

ls /somedir/*
23132 file1
8943 file2

An alternative solution would have been changing the v2 poll to use get * for each directory. But I didn't do this because I knew it wouldn't work with sr3.

I also tried path /somedir/* on sr3, but that fails because it tries to cd to that directory:

[WARNING] sarracenia.flowcb.poll cd sr_poll/cd: could not cd to directory /somedir/*
petersilva commented 1 week ago

in sr3, the structure returned from sftp is SFTPAttributes... ... we can already test if it is a link...

had a code snippet here... but it's wrong... see later comment...

We could add logic, perhaps based on option follow_symlinks where, if we find a link, we either don't return it at all, or return followed type... (by doing a stat of the destination and returning that instead...)

petersilva commented 1 week ago

might have to do that in the ls routine of the sftp driver though...

petersilva commented 1 week ago

This is where I meant:

https://github.com/MetPX/sarracenia/blob/1a0eb1645f62de0a719b877882259c586e39a136/sarracenia/transfer/sftp.py#L419-L427

we post process the SFTPAttributes records being looped there... and we can test for a symlink, and replace the record with the destination... we could do that either all the time, or conditionally on whether follow_symlink is set.

petersilva commented 1 week ago

oh... and to say v2 works with an ascii listing... where sr3 works on the SFTPAttributes... we should see an 'l' at the beginning of the line to be able to deduce that it is a link, and do a similar thing... I think it is easier/more worthwhile to do based on SFTPAttributes in sr3.

reidsunderland commented 1 week ago

At least on this server, the directory listing does show that the file is a symlink (lrwxrwxrwx) but doesn't show where the target is.

stat'ing each file is probably a good option, and we talked about implementing for HTTP #1131 polls already. But it would cause a lot more CPU usage and load on the remote server.

Maybe setting the size to 0 when we see a symlink is also a possibility?

petersilva commented 1 week ago

in sr3: the ls_attr call already present returns SFTPAttributes for every file in the directory... we don't need to stat again. In the SFTPAttributes record already returned is the same record returned by stat. There is already... the mode field which can be queried to find if it is a link... we do need to issue a readlink for each link we find if we want to find the destination... SFTP protocol has readlink...

in v2: the ls_attr is re-constructed into a string to return something that looks like what is returned by an FTP server... but the same sort of logic is there. it's just that v2 throws away the stat records and returns the string.

reidsunderland commented 1 week ago

I was imagining that stat'ing the file symlink would give the stat of the actual file. But I guess we would need to do readlink, then stat the file path that is returned by readlink

petersilva commented 1 week ago

options:

weirdness:

petersilva commented 1 week ago

currently poll is v2... these changes are too involved to do there... so likely want to test with sr3...

petersilva commented 1 week ago

back to talking about get...

if we just use path ... the what if someone does:


 path /a/*/hoho/*gw*

(ie. a path that is not within a single directory.) ... current logic uses cd to get to a directory, and then does ls (no args) within it... that logic would be broken by this... so some kind of refactor needed.

suggestion to use path (to establish baseline) and have a second argument for pattern within it. An alternate syntax...

  path a/1/hoho  *gw* 
  path a */hoho/*gw*

having the pattern as explicitly within a path (separate argument)

Another wrinkle.. the globbing pattern is a server-side grammar... need to know the server-side grammar to specify... e.g. I believe windows has a different globbing grammar.

Won't work on http... so API differences for different protocols.