cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
64 stars 268 forks source link

Consider using fsspec for file access #2145

Open kosack opened 1 year ago

kosack commented 1 year ago

Please describe the use case that requires this feature. Since astropy now optionally uses the fsspec library for open local and remote FITS files, and this library is also used by pandas and many others, it might be useful to replace parts of ctapipe's URL functionality (ctapipe.utils.download and ctapipe.utils.download_cached) with it.

It supports opening files from many filesystem sources (see list below) and using many compression methods, and supports caching as we use now.

Additional context

e.g.:

of = fsspec.open("github://cta-observatory:ctapipe@master/environment.yml")
with of as f:
    print(f.read())

Available support:

In [4]: fsspec.available_compressions()
Out[4]: [None, 'zip', 'bz2', 'gzip', 'lzma', 'xz', 'lz4', 'zstd']

In [5]: fsspec.available_protocols()
Out[5]:
['file',
 'memory',
 'dropbox',
 'http',
 'https',
 'zip',
 'tar',
 'gcs',
 'gs',
 'gdrive',
 'sftp',
 'ssh',
 'ftp',
 'hdfs',
 'arrow_hdfs',
 'webhdfs',
 's3',
 's3a',
 'wandb',
 'oci',
 'asynclocal',
 'adl',
 'abfs',
 'az',
 'cached',
 'blockcache',
 'filecache',
 'simplecache',
 'dask',
 'dbfs',
 'github',
 'git',
 'smb',
 'jupyter',
 'jlab',
 'libarchive',
 'reference',
 'generic',
 'oss',
 'webdav',
 'dvc',
 'root']
kosack commented 1 year ago

downside of course would be one more dependency, which maybe is not really needed (file access may only be via local files for the most part).

HealthyPear commented 1 year ago

Regarding this a possible use case is that of magic-cta-pipe, where data access happens via SSH tunnelings to some container at La Palma.

As far as I know the Python3 standard library doesn't provide API to deal with SSH. Maybe one can use subprocess, but it could present some safety issues.

An alternative seems to be Paramiko - anyway I think supporting something else than HTTP will add a new dependency (but we could define it as an extra...).

By the way, I think I can safely assume that the current API in utils doesn't support SSH tunneling, am I right?

@Elisa-Visentin

maxnoe commented 1 year ago

SSH is for shells. It does not deal with files. There are some tools which build upon ssh to transfer files (e.g. rsync, scp, sftp) or to mount files on a remote server (sshfs).

Tunneling is yet a different concept (exposing ports / jumping multiple hosts).

I don't see how that directly relates to input to ctapipe. Could you ofer some clarification?

HealthyPear commented 1 year ago

I don't see how that directly relates to input to ctapipe. Could you ofer some clarification?

Yes, probably this issue is about something yet different. Seeing "SSH" as one of the available protocols I thought it might have to do also with the actual connection - am I wrong?

As a "corollary", I wanted to understand if ctapipe.utils plans to support SSH tunneling to e.g. get test data from a server - I have the feeling that it works only via HTTPS.

kosack commented 1 year ago

@HealthyPear you can support multiple ssh jumps transparently just by editing your .ssh/config and adding an appropriate entry. For example, see here https://www.redhat.com/sysadmin/ssh-proxy-bastion-proxyjump. So after setting that up, you don't need support that in software. With such an entry (e.g. ProxyJump), you can then scp from that machine as if it wasn't going through another intermediate machine.

Or if you use a real tunnel, i.e. remapping a port to a local one, then we wiould have to support the port part of the URL (which we currently do not I think), but with something like fsspec that is supported.

maxnoe commented 1 year ago

(which we currently do not I think)

We just pass the URL to requests, so it will support non-standard ports

maxnoe commented 3 months ago

I think we probably should do this, as this will also enable us to avoid copying files to worker nodes on the grid if we can directly read from protocols like root:// or others that are supported by the storage elements / DIRAC.

Dirac has the option to either download files for the job and put them into the current directory or just to provide a url.