genouest / biomaj-download

Download microservice for BioMAJ
GNU Affero General Public License v3.0
1 stars 7 forks source link

FTP Timeout for one thread causes failure of whole download #11

Closed nsanilkumar-valluri closed 4 years ago

nsanilkumar-valluri commented 5 years ago

@osallou
I am trying to download file of size around 35 GB (to be more specific ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase//uniprot_trembl.fasta.gz) using biomaj. I configured 3 threads for download of this file. In one thread, it got error like [root][Thread-1] Could not get errcode:(28, 'FTP response timeout'). But remaining threads completed download, but after download complete, it throws error because of that thread failure. 1) Is this expected behaviour? 2) What was the configuration fix we can do? Thanks for your help.

osallou commented 5 years ago

Hi, i do not understand. If you download a single file, only one thread will be used, other threads will not be used

If you download multiple files, and 1 thread fails to fetch one fime, workflow will fail (but files remain downloaded in defined temp dir).

Workflow is OK only if all downloads are ok. In your case you get a timeout from remote server. If this is a temporary ftp error and ask biomaj to try download again, it will try to download only files that failed.

nsanilkumar-valluri commented 5 years ago

@osallou i get this error from remote server, but it is consistent for large files like uniprot_trembl.dat.gz (around 96 GB). Do we have any setting overwrite from our side?

osallou commented 5 years ago

Timeout is nothing consistent, whatever the size, if it is a server side timeout (no reponse from remote at connexion or during download)

If timeout is triggered by biomaj because download of a file was too long (1h by default i think), then you can increase it in bank property file with

 timeout.download=xxxx

xxx being number of seconds for timeout

Default is defined in global.properties but can be overidden by bank

nsanilkumar-valluri commented 5 years ago

@osallou thanks for response. sorry for not explain consistent. Consistent means, it was happening every time i tried to download (trembl file) file from that server. Error gets at different sizes (some time after 20 mins, 40 mins etc.) but unable to download at last. timeout.download is for downloading time extension (if download takes more time), but i don't think this is about download time, this is about inactive response time from FTP server. I thought this was because of Curl limitation when downloading through state servers/NAT, for this, there is one parameter like CURLOPT_FTP_RESPONSE_TIMEOUT. Do we have any field to overwrite this on our Biomaj?

osallou commented 5 years ago

Those timeouts are linked to timeout to establish session with remote server. This is not your case if it happebs at 20/40 mins. Seems download on remote is stalled/frozen for some reason. Biomaj has no parameter for those and i don't think curl has any for this case.

I do not face this issue locally, so maybe you have network issues on large downloads. Maybe your nat/proxy cuts connexions after some time?

osallou commented 5 years ago

You can try to modify ftp.py, modifying:

curl.setopt(pycurl.CONNECTTIMEOUT, 300)

Modifying 300 to a larger value. But if download has started and fails during download it will not help. And in this case it would always fail at 15mn .

nsanilkumar-valluri commented 5 years ago

Hi @osallou , Thanks for your help. I commented here(still it is closed) because it is relevant to this issue only. Tried to made changes for Timeout parameters, but it is not working. When i tried to download using curl command line, it got succeed.

Issue can be reproduced with following snippet.

import os
from datetime import datetime
import time
import hashlib

def curl_download(file_to_download, file_path):
    error = True
    nbtry = 1
    while(error is True and nbtry < 2):
        fp = open(file_path, "wb")
        curl = pycurl.Curl()
        try:
            curl.setopt(pycurl.URL, file_to_download)
        except Exception:
            curl.setopt(pycurl.URL, file_to_download.encode('ascii', 'ignore'))

        curl.setopt(pycurl.CONNECTTIMEOUT, 300)
        # Download should not take more than 5minutes
        curl.setopt(pycurl.TIMEOUT, 86400)
        # After adding next line(for FTP response timout) also it failed (Note: It added that much wait at end to display result.)
        curl.setopt(pycurl.FTP_RESPONSE_TIMEOUT, 900)
        curl.setopt(pycurl.NOSIGNAL, 1)
        curl.setopt(pycurl.WRITEDATA, fp)
        try:
            curl.perform()
            print("Completed curl.perform")
            errcode = curl.getinfo(pycurl.HTTP_CODE)
            print("Trying to getinfo")
            if int(errcode) != 226 and int(errcode) != 200:
                print("Error code while downloading")
                error = True
                print('Error while downloading ' + file_to_download + ' - ' + str(errcode))
            else:
                print("NoError code while downloading")
                error = False
        except Exception as e:
            print('Could not get errcode:' + str(e))

        nbtry += 1
        curl.close()
        fp.close()
    return error

is_failure = curl_download('ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz', 'uniref100.fasta.gz')
if is_failure:
        print("Download was failed.")
else:
        print("Download successfull.")

Problem i can think of is 'FTP download is failed after download completes (at curl.perform level). This might be because of NATs or firewalls of my company network. I thought it was not receiving end code after transfer completes. I assumes this, because whenever i tried to increase timeout, it waits that much time (i tried for 2 hours FTP_RESPONSE_TIMEOUT) after download completes, but still fails at end.' After trying different set of parameters, i came to know that, it works with setting up TCP_KEEPALIVE options.

        curl.setopt(curl.TCP_KEEPALIVE, True)        
        curl.setopt(curl.TCP_KEEPIDLE, 120)
        curl.setopt(curl.TCP_KEEPINTVL, 60)

I added snippet to biomaj_download/download/ftp.py. It is working for me from biomaj side also.

My question is, can we have any way to trigger this KEEPALIVE flags to Biomaj? If not , and you think this is valid problem, can i contribute on this front?

osallou commented 5 years ago

Hi, thanks for investigating. I think we could modify biomaj-download to add option keep-alive, based on config (keeping defaults, and it set in global or bank properties then add a specific keep-alive)

there is currently a refactoring of download, so please do not create a PR on this, it may conflict with current rewrite. I gonna update the code on my side to add this option

nsanilkumar-valluri commented 5 years ago

@osallou thanks.

osallou commented 5 years ago

next release will include patch (via config or env variable, to be defined). In the meanwhile, you can use a local patch.

osallou commented 4 years ago

@nsanilkumar-valluri , latest updates added support for keep-alive.

You need to update biomaj-core, biomaj-download and biomaj packages. In global.properties, you need to add:

options.names=tcp_keepalive
options.tcp_keepalive=30  # for example, matches TCP_KEEPINTVL