Microservice to manage the downloads of biomaj.
A protobuf interface is available in biomaj_download/message/downmessage_pb2.py
to exchange messages between BioMAJ and the download service.
Messages go through RabbitMQ (to be installed).
Python3 support only, python2 support is dropped
If you make changes to protobuf code, you need to compile it to python code:
cd /tmp/protoc/
PB_REL="https://github.com/protocolbuffers/protobuf/releases"
curl -LO $PB_REL/download/v23.2/protoc-23.2-linux-x86_64.zip # Version used by GitHub Actions currently
unzip protoc-23.2-linux-x86_64.zip
cd ..../biomaj_download/message/
/tmp/protoc/bin/protoc --python_out=. downmessage.proto
To run the test suite, use:
LOCAL_IRODS=0 pytest -v tests/biomaj_tests.py
This command skips the test that need a local iRODS server.
Some test might fail due to network connection. You can skip them with:
NETWORK=0 pytest -v tests/biomaj_tests.py
export BIOMAJ_CONFIG=path_to_config.yml
python bin/biomaj_download_consumer.py
If package is installed via pip, you need a file named gunicorn_conf.py containing somehwhere on local server:
def worker_exit(server, worker):
from prometheus_client import multiprocess
multiprocess.mark_process_dead(worker.pid)
If you cloned the repository and installed it via python setup.py install, just refer to the gunicorn_conf.py in the cloned repository.
export BIOMAJ_CONFIG=path_to_config.yml
rm -rf ..path_to/prometheus-multiproc
mkdir -p ..path_to/prometheus-multiproc
export prometheus_multiproc_dir=..path_to/prometheus-multiproc
gunicorn -c gunicorn_conf.py biomaj_download.biomaj_download_web:app
Web processes should be behind a proxy/load balancer, API base url /api/download
Prometheus endpoint metrics are exposed via /metrics on web server
A common problem when downloading a large number of files is the handling of temporary failures (network issues, server too busy to answer, etc.).
Since version 3.1.2, biomaj-download
uses the Tenacity library which is designed to handle this.
This mechanism is configurable through 2 downloader-specific options (see Download options): stop_condition and wait_policy.
When working on python code, you can pass instances of Tenacity's stop_base
and wait_base
respectively.
This includes classes defined in Tenacity or your own derived classes.
For bank configuration those options also parse strings read from the configuration file. This parsing is based on the Simple Eval library. The rules are straightforward:
stop_base
and wait_base
respectively) can be used
by calling their constructor with the expected parameters.
For example, the string "stop_after_attempt(5)"
will create the desired object.
Note that stop and wait classes that need no argument must be used as constants (i.e. use "stop_never"
and not "stop_never()"
).
Currently, this is the case for "stop_never"
(as in Tenacity) and "wait_none"
(this slightly differs from Tenacity where it is "wait_none()"
).stop_all
and stop_any
) or wait policies (namely wait_combine
).+
can be used to add wait policies (similar to wait_combine
).&
and |
can be used to compose stop conditions (similar to wait_all
and wait_none
respectively).However, in this case, you can't use your own conditions. The complete list of stop conditions is:
stop_never
(although its use is discouraged)stop_after_attempt
stop_after_delay
stop_when_event_set
stop_all
stop_any
The complete list of wait policies is:
wait_none
wait_fixed
wait_random
wait_incrementing
wait_exponential
wait_random_exponential
wait_combine
wait_chain
Please refer to Tenacity doc for their meaning and their parameters.
Examples (inspired by Tenacity doc):
"wait_fixed(3) + wait_random(0, 2)"
and "wait_combine(wait_fixed(3), wait_random(0, 2))"
are equivalent and will wait 3 seconds + up to 2 seconds of random delay"wait_chain(*([wait_fixed(3) for i in range(3)] + [wait_fixed(7) for i in range(2)] + [wait_fixed(9)]))"
will wait 3s for 3 attempts, 7s for the next 2 attempts and 9s for all attempts thereafter (here +
is the list concatenation)."wait_none + wait_random(1,2)"
will wait between 1s and 2s (since wait_none
doesn't wait)."stop_never | stop_after_attempt(5)"
will stop after 5 attempts (since stop_never
never stops).Note that some protocols (e.g. FTP) classify errors as temporary or permanent (for example trying to download inexisting file). More generally, we could distinguish permanent errors based on error codes, etc. and not retry in this case. However in our experience, so called permanent errors may well be temporary. Therefore downloaders always retry whatever the error. In some cases, this is a waste of time but generally this is worth it.
When using the sftp
protocol, biomaj-download
must check the host key.
Those keys are stored in a file (for instance ~/.ssh/known_hosts
).
Two options are available to configure this:
When the host and the key are found in the file, the connection is accepted. If the host is found but the key missmatches, the connection is rejected (this usually indicates a problem or a change of configuration on the remote server). When the host is not found, the decision depends on the value of ssh_new_host:
reject
means that the connection is rejectedaccept
means that the connection is acceptedadd
means that the connection is accepted and the key is added to the fileSee the description of the options in Download options.
Since version 3.0.26, you can use the set_options
method to pass a dictionary of downloader-specific options.
The following list shows some options and their effect (the option to set is the key and the parameter is the associated value):
stop_base
or a string (see Retrying).LocalDownload
).stop_after_attempt(3)
(i.e. stop after 3 attempts).wait_base
or a string (see Retrying).LocalDownload
).wait_fixed(3)
(i.e. wait 3 seconds between attempts).LocalDownload
).CurlDownload
(and derived classes: DirectFTPDownload
, DirectHTTPDownload
).CurlDownload
(and derived classes: DirectFTPDownload
, DirectHTTPDownload
).CurlDownload
(and derived classes: DirectFTPDownload
, DirectHTTPDownload
).CurlDownload
(and derived classes: DirectFTPDownload
, DirectHTTPDownload
).default
, multicwd
, nocwd
, singlecwd
(case insensitive).CurlDownload
(and derived classes: DirectFTPDownload
, DirectHTTPDownload
) - only used for FTP(S)
.nocwd
and singlecwd
are usually faster but not always supported).default
(which is multicwd
at the time of this writing as in cURL).CurlDownload
(and derived classes: DirectFTPDownload
, DirectHTTPDownload
) - only used for SFTP
.SFTP
.~/.ssh/known_hosts
(where ~
is the home directory of the current user).reject
, accept
, add
.CurlDownload
(and derived classes: DirectFTPDownload
, DirectHTTPDownload
) - only used for SFTP
.reject
(i.e. refuse new hosts - you must add them in the file for instance with ssh
or sftp
).CurlDownload
(and derived classes: DirectFTPDownload
, DirectHTTPDownload
) - only used for HTTPS(S)
.HTTP
redirections.true
(i.e. follow redirections).Those options can be set in bank properties.
See file global.properties.example
in biomaj module.