need a new way to download and cache files from a server

michaelchin commented 1 year ago

two problems

download speed

Check if mod_deflate module has been enabled.

Use multithread to download files. See the link below. https://www.geeksforgeeks.org/simple-multithreaded-download-manager-in-python/

sync with remote files

Send ETag along in an If-None-Match header field to check the latest version from data server.

Cache Control and HTTP Expires header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expires

Also need to consider the web server caching problem. Even you have updated the resources on your server, there might be copies of old version files kept along the internet route path. For example, the Sydney Uni proxy server might decide to cache your files for a certain period of time. This is out of our control unless we also change the resource's URI, such as file name.

Use https://pypi.org/project/platformdirs/ to get cache dir

In the future, maybe consider using Geoserver to serve geo-spatial data and rasters.

brmather commented 1 year ago

We need a new way to download and cache files from a server. What I think we need is to provide users of GPlately with a JSON file that contains a URL to download plate models, a MD5 hash, and the local directory stored on the machine. These would also have the same level of detail for any age grids or spreading rate grids attached to a reconstruction.

e.g.

JSON file:

Muller2019
- URL
- MD5
- local directory
- age_grids
  - 0Ma
  - URL
  - MD5
  - local directory
  - 1Ma
  - URL
  - MD5
  - local directory

The MD5 hash is useful because if there is a mismatch between the MD5 hash on the local machine compared to what is stored in the JSON file, due to an updated version, then the file can be automatically re-downloaded. I’m not sure if this is the most elegant solution. It is better than the current solution, but I’m open to other ideas. I also thought it would be good to have a script (or notebook) that assembles the JSON file to be shipped with GPlately (or hosted on GitHub pages using GitHub Actions) so that any new downloadable file (plate models, age grids, etc.) can be easily added.

The current implemented is housed within gplately.download and the data files are hardcoded in gplately.data. It uses pooch to download files which is fine for single files, but quite slow if multiple files need to be downloaded simultaneously e.g. age grids.

michaelchin commented 1 year ago

It is better than the current solution, but I’m open to other ideas.

Can anyone explain briefly what's the problem with the current solution? I saw in the source code that we were using ETag, which should work if being implemented properly. I did not look deeper into the source code. Maybe someone can briefly explain the current solution a bit (within 100 words)? Thanks @lauren-ilano @laurilano

michaelchin commented 1 year ago

What I think we need is to provide users of GPlately with a JSON file

Let's call this json file "data catalog" for now. I think we should put this "catalog" in data server. A "data.DataCollection" object should be created from this "catalog" file. We should attach the "ETag" and "Expire" properties to this "catalog" file and update the latest version accordingly from the data server. We also need to allow users to reload the "catalog" explicitly.

brmather commented 1 year ago

Thanks @michaelchin for thinking more into this. Hopefully soon @lauren-ilano @laurilano will provide a summary of the current system and cross reference some of the key functions in this issue tracker.

I take your point about assigning the "data catalogue" and e-tag and expire properties. How regularly these data catalogues expire is up for debate. Some of the plate models we ship are no longer actively being developed and so would never expire. Conversely, there might be an unexpected bug in an existing plate model which would need to be patched and shipped. Do you propose that gplately occasionally check the e-tag of the remote data catalogue (depending on expiry date) and then download it if the e-tag is different?

The next step would then be to work out which data in the data catalogue have changed. We only want to download files that have been explicitly called by the user (e.g. DataServer("Muller2019") downloads the Muller et al 2019 plate model). They should only be re-downloaded if a newer version exists on the server. The data catalogue would need to keep track of this information.

laurilano commented 1 year ago

Hi @michaelchin! Currently, DataServer sends a web request to check the e-tag of the URL every time the user requests a set of files from webDAV. Common GPlately workflows do this twice (two web requests) - once to get files for the PlateReconstruction object, and another to get files for the PlotTopologies object. E-tags are currently kept as strings in unique txt files for each DataServer file, with URL md5 hashes as their filename. Current gplately caches look a bit messy:

Overall, sending multiple web requests to compare a URL’s etag with the etag in the txt file is quite slow. The goal is to implement a system that compares file hashes as minimally as possible, also with minimum user involvement.

michaelchin commented 1 year ago

Hi @michaelchin! Currently, DataServer sends a web request to check the e-tag of the URL every time the user requests a set of files from webDAV. Common GPlately workflows do this twice (two web requests) - once to get files for the PlateReconstruction object, and another to get files for the PlotTopologies object. E-tags are currently kept as strings in unique txt files for each DataServer file, with URL md5 hashes as their filename. Current gplately caches look a bit messy: Overall, sending multiple web requests to compare a URL’s etag with the etag in the txt file is quite slow. The goal is to implement a system that compares file hashes as minimally as possible, also with minimum user involvement.

Thanks @laurilano . Is there a test case which I can run to see how slow the process is? I need to benchmark the speed/performance so that we can compare the results with the new code. Thanks.

michaelchin commented 1 year ago

@brmather @laurilano

Here are some rough ideas

make sure the mod_deflate module has been enabled in the server
use Etag and Expiry header to update cache
data catalog
multithreading

I will create new issues for each work item, and then we can discuss and trace progress.

brmather commented 1 year ago

Hi @michaelchin

Here is a code snippet you can use to benchmark:

gdownload = gplately.download.DataServer("Muller2019")
rotation_model, topology_features, static_polygons = gdownload.get_plate_reconstruction_files()
coastlines, continents, COBs = gdownload.get_topology_geometries()

Your ideas look good. Re data catalog: it would be good to ensure users have a convenient way to add new data products to the data catalog e.g. rasters, reconstructions etc. (preferably via GitHub). I'm not sure how that would look like.

brmather commented 1 year ago

Some thoughts on the DataServer object... It might be better to return an object for each plate reconstruction that has attributes for rotation_model, topology_features, etc.

e.g.

plate_model = gplately.download.get_model_muller2019()
plate_model.rotation_model # rotation model is accessed as an attribute
plate_model.get_age_grids(list_of_times) # download multiple age grids in parallel

It's only semantically different, but I think it is more user-friendly. Any thoughts @michaelchin @laurilano @lauren-ilano ?

jcannon-gplates commented 1 year ago

plate_model = gplately.download.get_model_muller2019()
plate_model.rotation_model # rotation model is accessed as an attribute
plate_model.get_age_grids(list_of_times) # download multiple age grids in parallel

I like this idea from a user/API perspective! (as an alternative to returning tuples like rotation_model, topology_features, static_polygons)

laurilano commented 1 year ago

Some thoughts on the DataServer object... It might be better to return an object for each plate reconstruction that has attributes for rotation_model, topology_features, etc.

e.g.
plate_model = gplately.download.get_model_muller2019()
plate_model.rotation_model # rotation model is accessed as an attribute
plate_model.get_age_grids(list_of_times) # download multiple age grids in parallel
It's only semantically different, but I think it is more user-friendly. Any thoughts @michaelchin @laurilano @lauren-ilano ?

Hi Ben, this definitely looks more user-friendly! Do you think it would be preferable to keep the existing process that semi-automatically identifies relevant plate model files using the strings to include/strings to ignore functions (e.g. https://github.com/GPlates/gplately/blob/master/gplately/data.py#LL202C9-L202C43), or would it be better for us to manually categorise the plate model files under rotation models, topologies, static polygons, continents etc. for each function, e.g. get_model_muller2019()?

michaelchin commented 1 year ago

Do you think it would be preferable to keep the existing process that semi-automatically identifies relevant plate model files using the strings to include/strings to ignore functions (e.g. https://github.com/GPlates/gplately/blob/master/gplately/data.py#LL202C9-L202C43), or would it be better for us to manually categorise the plate model files under rotation models, topologies, static polygons, continents etc. for each function, e.g. get_model_muller2019()?

We should define a class called RotationModel. The RotationModel should contain rotation files and a set of layers, such as age grids, coastlines, static polygons, COBs, etc. And we should use layer_name(string) to identify them. The layer should container URL or file path to the files.

brmather commented 1 year ago

@michaelchin I think @laurilano is referring to zip files that contain the plate reconstruction files. Some objects (e.g. topology_features) are comprised of multiple files. Currently there is some basic logic built into the DataServer object to sift through the unzipped folders and identify all .rot files as rotations, gpml as topologies or coastlines, COBs, etc. depending on string matching. It's currently a bit messy, so I would prefer the location of these files be hard-coded in the data catalog.

We should define a class called RotationModel. The RotationModel should contain rotation files and a set of layers, such as age grids, coastlines, static polygons, COBs, etc. And we should use layer_name(string) to identify them. The layer should container URL or file path to the files.

I agree, but we shouldn't call the class RotationModel because rotation models are just a subset of the files bundled in a plate reconstruction. We ought to call it PlateModel instead.

GPlates / gplately

need a new way to download and cache files from a server #91