genouest / biomaj-download

Download microservice for BioMAJ
GNU Affero General Public License v3.0
1 stars 7 forks source link

Issue with biomaj cache #39

Closed braffes closed 3 years ago

braffes commented 3 years ago

I have an issue when biomaj try to use cache. On some bank it always works, and on other banks it never works.

2020-12-22 20:05:06,151 INFO  [root][MainThread] Workflow:DownloadService:CleanSession
2020-12-22 20:05:07,372 INFO  [root][MainThread] Workflow:wf_download:nb_files_to_download:3
2020-12-22 20:05:07,373 INFO  [root][MainThread] Workflow:wf_download:release:remoterelease:2018-8-24
2020-12-22 20:05:07,373 INFO  [root][MainThread] Workflow:wf_download:release:release:2018-8-24
2020-12-22 20:05:07,406 INFO  [root][MainThread] Workflow:wf_download:nb_expected_files:3
2020-12-22 20:05:07,406 INFO  [root][MainThread] Workflow:wf_download:nb_files_in_offline_dir:0
2020-12-22 20:05:07,407 INFO  [root][MainThread] Workflow:wf_download:Cache:using /opt/biomaj/cache/local_files_1608663732.1273413                                                                   
2020-12-22 20:05:07,408 ERROR [root][MainThread] Workflow:download:Exception:'dict' object has no attribute 'endswith'                                                                                     
Traceback (most recent call last):
  File "/opt/biomaj/lib64/python3.6/site-packages/biomaj/workflow.py", line 131, in start
    self.session._session['status'][flow['name']] = getattr(self, 'wf_' + flow['name'])()
  File "/opt/biomaj/lib64/python3.6/site-packages/biomaj/workflow.py", line 1476, in wf_download
    downloader.download_or_copy(last_production_files, last_production_dir)
  File "/opt/biomaj/lib64/python3.6/site-packages/biomaj_download/download/interface.py", line 487, in download_or_copy                                                                                    
    self.set_files_to_download(new_files_to_download)
  File "/opt/biomaj/lib64/python3.6/site-packages/biomaj_download/download/direct.py", line 69, in set_files_to_download                                                                                   
    return super(DirectFTPDownload, self).set_files_to_download(files_to_download)
  File "/opt/biomaj/lib64/python3.6/site-packages/biomaj_download/download/interface.py", line 343, in set_files_to_download                                                                               
    self._append_file_to_download(file_to_download)
  File "/opt/biomaj/lib64/python3.6/site-packages/biomaj_download/download/direct.py", line 54, in _append_file_to_download                                                                                
    if filename.endswith('/'):
AttributeError: 'dict' object has no attribute 'endswith'
2020-12-22 20:05:07,410 ERROR [root][MainThread] Error during task download
2020-12-22 20:05:07,410 INFO  [root][MainThread] Workflow:wf_over

The content of the file /opt/biomaj/cache/local_files_1608663732.1273413 :

[{"root": "/", "permissions": "", "group": "", "user": "", "size": 23723408, "month": 6, "day": 19, "year": 2018, "name": "/common/downloads/release-38/Pfalciparum3D7/fasta/data/PlasmoDB-38_Pfalciparum3D7_Genome.fasta", "hash": "e58669a71eacff7a9dcceed04a8ecdd1", "save_as": "PlasmoDB-38_Pfalciparum3D7_Genome.fasta", "url": "https://plasmodb.org"}]

Do you get the this error some time? The only solution to update the bank is to use --from-scratch option when it is bugged.

Here is the process to get the error:

Take a bank with directhttps protocol used
Update this bank 
change remote file, for example use the release 37 instead of release 38 for the file PlasmoDB-38_Pfalciparum3D7_Genome.fasta
Change the remote release (using release file --> no using release file).

And you will get the error above.

I noticed, it works with bank using ftp protocol but not directhttps protocol and directhttp. But it can be a coincidence. Because, as we can see in the file direct.py below, the cache seems to work only there is only one file to download. And I download a lot of files with ftp and this is not the case with directhttp.

    def set_files_to_download(self, files_to_download):
        if len(files_to_download) > 1:
            self.files_to_download = []
            msg = self.__class__.__name__ + ' accepts only 1 file'
            self.logger.error(msg)
            raise ValueError(msg)
        return super(DirectFTPDownload, self).set_files_to_download(files_to_download)

I hope you have enough information to debug this issue .

Thanks for you attention,

Brice

osallou commented 3 years ago

Thanks for all this debug info, There is certainly an issue with direct protocols handling with cache. I did not face issue but do not really manage directhttp in prod, only in test and cache is not used in test.

Will check that

osallou commented 3 years ago

will be fixed in biomaj-download 3.2.4