MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
45 stars 22 forks source link

do transfer (download and send) via memory... without saving to local file... #590

Open petersilva opened 1 year ago

petersilva commented 1 year ago

I think @reidsunderland was talking about this. Someone has a case where they want to transfer from an upstream source to a downstream destination. They don't want to write the file locally, because it's essentially just a buffer for the transfer.

Was talking today with a client, and they want this for S3 uploading... that is, download the file to a buffer, and then upload each buffer to the destination.

I think this could be done with an subscribe, by sub-classing sarracenia.transfer class, to override the on_data entry point to write the data to a pre-established output file descriptor.

along with a flowcb to:

So that way, the entire transfer is only using one memory buffer.

anyways, that's one guess at how to do it... it might be wrong.

petersilva commented 1 year ago

note that for most of our purposes, we want to save locally, because there are 10 destinations for each source product, so going via memory would mean retrieving from source 10 times. So it's not great for many cases, but in cases, where we truly deliver to one location, and where if the downstream is having an issue, it is ok to be bothering upstream, then this model is helpful.

ymoisan commented 1 year ago

Mounting S3 as a file system in a local directory as per https://cloud.netapp.com/blog/amazon-s3-as-a-file-system allows for that local mount point to act as a relay : no files are written is sr_subscribe is set up to write there; files are shipped directly on S3. The issue with that is the mounted directory is available to anyone with access to it e.g. cp there will upload a file to S3 as well.

I think it would make sense to use something like the fsspec : A specification for pythonic filesystems in the sarracenia code base. s3fs is based on both fsspec and boto3. fsspec also allows access to object cloud storage other than s3. All in a package that is apparently well maintained.

petersilva commented 1 year ago

yeah... That's a good usage for most purposes. People can just use and s3 file system today, and if they are happy with that, it's fine.

For generalizing file system access, fsspec looks interesting feature-wise. I don't know if you have been following along with work on the supercomputer, but dependencies have been huge complications in installation. To start checking things out, I tried installing on my ubuntu 22.04 pc:

fractal% sudo apt install python3-fsspec
[sudo] password for peter: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libflashrom1 libftdi1-2
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  fonts-lyx libboost-dev libboost1.74-dev libclang-cpp11 libjs-jquery-ui liblbfgsb0 libllvm11 libopenblas-dev libopenblas-pthread-dev libtbb12 libxsimd-dev
  llvm-11 llvm-11-dev llvm-11-linker-tools llvm-11-runtime llvm-11-tools numba-doc python-babel-localedata python-matplotlib-data python-odf-doc
  python-odf-tools python-tables-data python3-appdirs python3-babel python3-beniget python3-blosc python3-bottleneck python3-brotli python3-cloudpickle
  python3-cycler python3-dask python3-decorator python3-distributed python3-dropbox python3-et-xmlfile python3-fonttools python3-fs python3-fusepy python3-gast
  python3-heapdict python3-jdcal python3-jinja2 python3-kiwisolver python3-libarchive-c python3-llvmlite python3-locket python3-lz4 python3-matplotlib
  python3-mpmath python3-msgpack python3-numba python3-numexpr python3-odf python3-openpyxl python3-pandas python3-pandas-lib python3-partd python3-pygit2
  python3-pythran python3-scipy python3-sortedcontainers python3-stone python3-sympy python3-tables python3-tables-lib python3-tblib python3-toolz
  python3-tornado python3-ufolib2 python3-unicodedata2 python3-xlwt python3-zict python3-zmq unicode-data
Suggested packages:
  libboost-doc libboost1.74-doc libboost-atomic1.74-dev libboost-chrono1.74-dev libboost-container1.74-dev libboost-context1.74-dev libboost-contract1.74-dev
  libboost-coroutine1.74-dev libboost-date-time1.74-dev libboost-exception1.74-dev libboost-fiber1.74-dev libboost-filesystem1.74-dev libboost-graph1.74-dev
  libboost-graph-parallel1.74-dev libboost-iostreams1.74-dev libboost-locale1.74-dev libboost-log1.74-dev libboost-math1.74-dev libboost-mpi1.74-dev
  libboost-mpi-python1.74-dev libboost-numpy1.74-dev libboost-program-options1.74-dev libboost-python1.74-dev libboost-random1.74-dev libboost-regex1.74-dev
  libboost-serialization1.74-dev libboost-stacktrace1.74-dev libboost-system1.74-dev libboost-test1.74-dev libboost-thread1.74-dev libboost-timer1.74-dev
  libboost-type-erasure1.74-dev libboost-wave1.74-dev libboost1.74-tools-dev libmpfrc++-dev libntl-dev libboost-nowide1.74-dev libjs-jquery-ui-docs libxsimd-doc
  llvm-11-doc python-blosc-doc python-bottleneck-doc python-cycler-doc ipython python-dask-doc python3-bcolz python3-graphviz python3-h5py python3-skimage
  python3-sklearn python3-sqlalchemy python-fsspec-doc python-jinja2-doc llvmlite-doc dvipng fonts-staypuft ipython3 python-matplotlib-doc python3-gobject
  python3-sip texlive-extra-utils texlive-latex-extra python-mpmath-doc python3-gmpy2 nvidia-cuda-toolkit python-pandas-doc python3-statsmodels
  python-pygit2-doc python-scipy-doc python-sortedcontainers-doc texlive-fonts-extra python-sympy-doc python3-netcdf4 python-tables-doc vitables
  python-toolz-doc python-tornado-doc python3-twisted python3-xlrd python-xlrt-doc
The following NEW packages will be installed:
  fonts-lyx libboost-dev libboost1.74-dev libclang-cpp11 libjs-jquery-ui liblbfgsb0 libllvm11 libopenblas-dev libopenblas-pthread-dev libtbb12 libxsimd-dev
  llvm-11 llvm-11-dev llvm-11-linker-tools llvm-11-runtime llvm-11-tools numba-doc python-babel-localedata python-matplotlib-data python-odf-doc
  python-odf-tools python-tables-data python3-appdirs python3-babel python3-beniget python3-blosc python3-bottleneck python3-brotli python3-cloudpickle
  python3-cycler python3-dask python3-decorator python3-distributed python3-dropbox python3-et-xmlfile python3-fonttools python3-fs python3-fsspec
  python3-fusepy python3-gast python3-heapdict python3-jdcal python3-jinja2 python3-kiwisolver python3-libarchive-c python3-llvmlite python3-locket python3-lz4
  python3-matplotlib python3-mpmath python3-msgpack python3-numba python3-numexpr python3-odf python3-openpyxl python3-pandas python3-pandas-lib python3-partd
  python3-pygit2 python3-pythran python3-scipy python3-sortedcontainers python3-stone python3-sympy python3-tables python3-tables-lib python3-tblib
  python3-toolz python3-tornado python3-ufolib2 python3-unicodedata2 python3-xlwt python3-zict python3-zmq unicode-data
0 upgraded, 75 newly installed, 0 to remove and 4 not upgraded.
Need to get 143 MB of archives.
After this operation, 883 MB of additional disk space will be used.

On the one hand, this means roughly (by eye) 30 ish dependencies. and those deps aren't trivial... including: llvm, libboost, matplotlib, scipy, pandas, openblas ... I guess the good side is performance is likely to be good because there is a lot of leveraging of high performance C libraries. There is also tornado? you need a web framework to use a file system? In total, it's pulling in 883 MB of dependencies.

Looking at redhat8, it does not show up at all.... so installing on redhat involves establishing a large venv (with a few gb of parallel environment for dependencies) or going with one of the python distributions like Anaconda.

In a Scientific python context, this actually isn't bad... most of this stuff should be on hand anyways. Sarracenia is often deployed as a system tool, however, not an application, and getting it installed at system level, such large dependencies will give a lot of people pause.

petersilva commented 1 year ago

fwiw, I forked @tomkralidis 's metpx-cloud-publisher, and made an sr3'ish version. don't have S3 bucket to play with easily, but it still Tom's stuff, just re-cast in sr3.

https://github.com/petersilva/metpx-cloud-publisher/tree/sr3_version

You can clone it and try it out. It's 3 lines shorter ... whoop dee doo... But I see there is no error handling even in the old version... likely a concern eventually.

petersilva commented 1 year ago

comparing with fsspec, a similar experiment pulling in boto3 required less than 60 MB of dependencies... which is already a lot. boto3 is what Tom's plugin needs.

petersilva commented 1 year ago

We also have a boto3 example for polling American RADAR data from AWS:

petersilva commented 1 year ago

I've just grasped @ymoisan 's deployment concept... (I'm slow... it takes a while ;-) if you are in a cloud container, and don't have local file system access, then it doesn't matter much how you give the cloud container file access. so and sshfs or s3fs mounted in the container make a lot of sense, and gives the same benefit of the through memory case, without need of code.

It's probably fine (for inside a cloud container) and if it ever isn't fine, then can revisit and still write the memory stuff later when we know more.

petersilva commented 1 year ago

So there is a still a cause for concern... which is how the container is shutdown. sr3 traps signal 15, and tries hard to clean up properly before exiting. The container work I did last year with the WMO:

Why does the instances vs. container thing matter? instances share state, a queue name is created in one instance, and read by the others via a state file ( ~/.cache/sr3/subscribe/config/bla.qname ) there is process management that uses pid's, and sr3 status to understand how many instances are running. a typical sr3 pump will have hundreds of different configurations running with multiple instances per configuration (ddsr.cmc, I think normally 900 or so processes running.)

so all that stuff is broken/pointless if every process runs in it's own container. In cases where there is a decent local file system available, as a source or sink for data, containers make no sense and will just slow things down with added overhead.

In the pure cloud context, there is no local file system... so using drives to create a container scope fs makes much more sense. in the pure cloud context sr3 status has no role, and have to figure out how to behave well as a managed container.

petersilva commented 1 year ago

Tom's thing was written for v2... It doesn't exit cleanly when the container is stopped. and the sr3 one won't either. because "foreground" is intended for debugging... you don't care about data loss.

the proper way to do container shutdown is to run sr3 start xxxx and have something else as the anchor task. then you have a signal handler... or shutdown handler (dunno how docker does such things) that calls sr3 stop xxxx and then the container shutdown should be clean.

I pushed to my branch the change in logic.

ymoisan commented 1 year ago

Sorry I haven't chimed in before. So, about terminating container, that will be the orchestrator's job. We will be deploying a Rancher instance on two nodes in science in the near future. Rancher is really a Kubernetes wrapper that makes it easy to manage clusters of machines to run workloads. My thinking is starting/stopping a container will stop sr_subscribe. If we need special shutdown instructions that can be accommodated I'm pretty sure. Passing credentials (e.g. ~/.aws-credentials) via K8s secrets and Sarracenia configuration parameters to the container should allow us to tailor sarracenia subscriptions "easily". The devil hides in the details, of course.

An important item to consider IMO is that fsspec includes implementations for other cloud providers : Azure and Google Cloud (and probably others). My point is if we can shove an s3:// url in sarracenia configuration and it would be handled internally by fsspec/s3fs -- which is built on top of boto3 -- then we could do the same for uploads to other cloud storage and even to a WAF we would have write access to.