fair-research / bdbag

Big Data Bag Utilities
https://fair-research.org
Apache License 2.0
49 stars 23 forks source link

1.6.0 release branch. Includes fetch refactor to support extensible transports and other enhancements, bugfixes. #38

Closed mikedarcy closed 3 years ago

mikedarcy commented 4 years ago

This PR targets #37 and refactors the bdbag.fetch.transports module to support externally implemented transports that can be plugged-in (i.e., declared as run-time imports) via the fetch_config configuration object in the bdbag.json config file.

This requires developers to perform the following integration tasks:

  1. Create a class deriving from bdbag.fetch.transports.base_transport.BaseFetchTransport and implement three required functions:

    • __init__(self, config, keychain, **kwargs): The class constructor. Derived classes should first call super(<derived class name>, self).__init__(config, keychain, **kwargs) which sets the config, keychain, and kwargs variables as class member variables with the same names.

    • fetch(self, url, output_path, **kwargs): This method should implement the logic required to transfer a file referenced by url to the local path referenced by output_path. The **kwargs argument is an extensible argument dictionary that the framework may populate with extra data, for example: an integer argument size may present (if it can be found in fetch.txt for a given fetch entry), representing the expected size of the remote file in bytes.

    • cleanup(self): This method should implement any transport-specific release of resources. Note this function will be called only once per-transport at the end of a entire bag fetch, and not once per-file.

  2. Configure the usage of the external transport via the fetch_config object of the bdbag.json configuration file. The fetch_config object is comprised of child configuration objects keyed by a lowercase string value representing the URL protocol scheme that is being configured. When configuring an external handler, the following applies:

    • There is a single required top-level string parameter with the key name handler which maps to the fully-qualified class name implementing the required methods of the bdbag.fetch.transports.base_transport.BaseFetchTransport base class. At runtime the bdbag fetch framework code will attempt to load this class via importlib.import_module machinery and if successful, it will be instantiated and returned to the bdbag fetch framework code and the class instance cached for the duration of the bag fetch operation. Subsequently, whenever a URL is encountered in fetch.txt with a protocol scheme matching that of the installed handler, that handler's fetch method will be invoked.

    • There is also an optional string parameter, allow_keychain, which must be present and set to "True" in order to toggle the propagation of the bdbag keychain into the handler code during the __init__ call. If the allow_keychain parameter is missing or set to any other value that cannot be evaluated as a Python boolean True, then the value of the keychain variable passed to the __init__ call will be None. In general, if the custom handler code has its own mechanism for managing credentials, then this parameter may be omitted. If the handler intends to make use of the bdbag keychain that is currently in context for the current user and fetch operation, then this parameter must be present and evaluate to True.

    • The remainder of the protocol scheme handler configuration object can consist of any valid JSON; the entire object value assigned to the scheme key will be passed as the config parameter to the __init__ method of the custom handler.

For example, given the following fetch_config section:

{
    "fetch_config": {
        "s3": {
            "handler":"my.custom.S3Transport",
            "max_read_retries": 5,
            "read_chunk_size": 10485760,
            "read_timeout_seconds": 120
        },
    "foo": {
            "handler":"my.custom.FooTransport",
            "allow_keychain": true,
            "my_foo_complex_config": {
                "bar":[
                    "a","b","c"
                ],
                "baz":{
                    "xyz":123
                }
            }
        }
    }
}

For the scheme foo, the following object will be passed as the config parameter to the __init__ method of my.custom.FooTransport upon class instantiation:

{
    "handler":"my.custom.FooTransport",
    "allow_keychain": true,
    "my_foo_complex_config": {
        "bar":[
            "a","b","c"
         ],
        "baz":{
            "xyz":123
        }
    }
}
coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 269


Changes Missing Coverage Covered Lines Changed/Added Lines %
bdbag/fetch/transports/base_transport.py 8 9 88.89%
bdbag/bdbag_config.py 30 32 93.75%
bdbag/fetch/transports/fetch_ftp.py 47 51 92.16%
bdbag/fetch/transports/fetch_http.py 140 148 94.59%
<!-- Total: 347 362 95.86% -->
Totals Coverage Status
Change from base Build 262: 3.1%
Covered Lines: 3233
Relevant Lines: 3360

💛 - Coveralls