antarctica / bas-apres

Package for working with BAS ApRES files
Apache License 2.0
1 stars 3 forks source link

adding capability to load from cloud storage #1

Open jkingslake opened 2 weeks ago

jkingslake commented 2 weeks ago

Hello,

Thanks for creating and sharing a great tool!

I have been working with large ApRES data sets stored in google cloud buckets.

Is there any interest in adding the capability to load ApRES from google buckets (and other cloud storage locations) to bas-apres?

I have hacked together something that sees to work on a fork, which loads ApRES data I have in google bucket as follows:

sys.path.append("../../bas-apres/")
import apres as ap

remote_dat_filename = 'gs://ldeo-glaciology/apres/thwaites/continuous/ApRES_LTG/SD1/DIR2023-01-25-0435/DATA2023-02-16-0437.DAT'

with ap.ApRESFile(remote_dat_filename) as f: 
    f.read()

There is probably a much more elegant way of doing this, but so far it seems like its working well enough for my purposes.

Any interest in a working on a PR together on this? I would love to incorporate this into the package if you think it could be useful to people.

Some background:

I am planning on using the package as part of a different package we have been developing called xapres which is for loading and processing ApRES data using xarray. Currently the loading is done in an inefficient way and incorporating bas-apres would speed things up a lot. However, I do load from google buckets a lot, so I would need this capability if this was going to happen.

paul-breen commented 1 week ago

Hi @jkingslake,

This is a nice idea. A few immediate questions I have:

jkingslake commented 1 week ago

Great!

How stable is the GCS package and its API

I know gcsfs is used quite widely, but I really dont know how stable it is to be honest. It is built on fsspec, which is more general and probably more stable. gcsfs is just a google-specific way of calling fsspec, as far as I understand. So maybe the best thing to do is to implement this functionality with fsspec instead. I just used gcsfs because I have stored our data in google cloud storage.

Presumably this requires some kind of authentication. How are you handling credentials etc.

All ApRES data I have stored in google cloud storage is completely open to the public, so no authentication is needed to access it. To write anything to a bucket would require authentication. I have always done that with a token, as shown here: https://ldeo-glaciology.github.io/glaciology-intro-book/sections/appendix/upload_Measures_data_to_bucket.html#write-to-bucket

Is this something that could be easily extended to support other cloud storage platforms

Yes, I believe so, using fsspec (see above). Although I have not often used other cloud storage platforms, I think s3 works in a very similar way and I think you could deal with any of the main platforms using fsspec - I would need to look into it more.

In the interests of keeping the package's dependencies fairly mainstream, I wonder if this should be implemented as a sub-package

I can see the desire to keep the dependencies minimal. Is another option to have fsspec as an optional dependency that would only get imported if you try to read from cloud strorage?

paul-breen commented 1 week ago

Thanks for the comments, that's really useful. So it seems that fsspec might be the way to go. I think having a working assumption that any remote files are publicly accessible (read-only) is reasonable. Regarding dependencies - it's fine to only import if trying to read from cloud storage, but that would mean that the dependency itself needs to be optional, hence suggesting a sub package. Maybe a better way to handle this though, is by declaring fsspec to be an optional dependency of the package (an extra). That would probably be much easier to manage and is then up to the user to install along with the core package if they require that functionality.

jkingslake commented 1 week ago

OK, sounds good. So just to be clear, should I put import fsspec inside the if statement where we check for whether its remote load (here)?

paul-breen commented 1 week ago

Yes, the import would need to be inside the if statement. Then the fsspec package would need to be made an optional dependency of the bas-apres package in its pyproject.toml file: https://python-poetry.org/docs/pyproject/#extras.

I would suggest that the import should also be wrapped in a try/except block and if an ImportError is raised, then print out a suggestion to the user to install this optional dependency. For example, if the extra is named remote, say, then something like:

To open files on remote storage the 'remote' optional dependency must be installed: pip install bas-apres[remote]

We'd have something like this:

        if isinstance(self.fp, gcsfs.core.GCSFile):
            try:
                import fsspec
            except ImportError:
                raise ImportError('To open files on remote storage the 'remote' optional dependency must be installed: pip install bas-apres[remote]')

            self.remote = True
        else:
            self.remote = False
jkingslake commented 1 week ago

I pushed these changes to the remote_load branch in my fork: https://github.com/jkingslake/bas-apres/tree/remote_loading

It now:

Let me know what you think.

paul-breen commented 1 week ago

Thanks, this looks really good.

A question: Have you tried the CLI commands with remote files. Particularly those that write a new file. For instance, by default apres_to_nc will write the output of the conversion to netCDF to a file with the same path as the input file, but with a .nc suffix. Does this still work when the input is a remote file (you'd need write access). This isn't a showstopper if not, as you can optionally provide an explicit output filename, which could be local, but I'm interested to know how/if this works.

jkingslake commented 5 days ago

apres_to_nc works for loading from a file in a remote bucket and writing it to a local netcdf.

I haven't got it working to write back to the remote location. This would require some modification to to_netcdf to allow it to open and write to a bucket object (which might be different in different cloud services, i.e. s3 vs gcs). We would also need to allow the user to pass a token.

I think this is doable, but I currently dont fully understand to_netcdf, so it would take me a bit more digging.

I also think it is not necessarily a common use case. In my work flow I have taken data files from cloud storage and either used them locally in memory, or written to zarr in cloud storage (via xarray).

Maybe a bigger project could be to add to bas-apres the capability to create xarrays and then writing them to zarr stores in could storage would be straightforward. This is actually a big part of what xapres is for. It creates xarrays with the following format

image

So I think that would be a good solution.

For now, what do you think about just allowing netcdfs to be written locally, even when they derive from remote dat files?

paul-breen commented 3 days ago

Yes, writing netCDFs locally is fine. I was just wondering if your mods. work with reading (and possibly writing) remote files via the CLI tools.

OK, can you do a PR of your changes against the original bas-apres project. This will make it easier for me to review and approve all your changes. Thanks again.

jkingslake commented 3 days ago

done. #2 Thanks for taking a look!