genouest / biomaj-download

Download microservice for BioMAJ
GNU Affero General Public License v3.0
1 stars 7 forks source link

Download Bank data into S3 location #4

Open nsanilkumar-valluri opened 5 years ago

nsanilkumar-valluri commented 5 years ago

I want to download bank related data into S3 bucket instead of local file system. When i tried to configure s3 path in data.dir variable, it creates that path in current directory and download data into that particular folder. Can anyone please help me to configure AWS s3 location for downloaded data.

osallou commented 5 years ago

you cannot save in s3, only download from s3 if you want to save in s3, you should still save in local dir then via a post-process push data to s3 (and delete local data, but in this case on next update you will have to download everything)

nsanilkumar-valluri commented 5 years ago

Thanks for replay @osallou . Is there any option to use some other database instead of local file system ?

osallou commented 5 years ago

nope, the goal is to get local files. Only other way (for the moment) is to use above solution ie push data after update via a post-process

nsanilkumar-valluri commented 5 years ago

@osallou Thanks for help.

nsanilkumar-valluri commented 5 years ago

@osallou can you please help me out by pointing to any such example process and bank file. I thought about using same script for copy based on destination location. But, i am not able to get the source location on the fly every different bank.

nsanilkumar-valluri commented 5 years ago

can i create my own setting like s3.remote.location for destination folder and pass this as argument for process script ?

osallou commented 5 years ago

Property files support interpolation, so you can create your own variables and use them elsewhere in properties, like:

myvar=myvalue myproc.args= %(myvar)s bla bla bla

I have no process example for s3 but you have db and process examples at https://github.com/genouest/biomaj-data?files=1

nsanilkumar-valluri commented 5 years ago

It is really helpful. Thanks @osallou

nsanilkumar-valluri commented 5 years ago

HI @osallou, thanks for your help before. I have usecase, that requires only some processing but not any download. How can i achieve in biomaj configuration. I tried to keep 'protocol' field none but it is only working if there are any depends banks.

osallou commented 5 years ago

simply use local protocol with a fake local file to "simulate" a download. And touch this file to update its last modified date to consider a new "workflow" Or create it as a pre process.

nsanilkumar-valluri commented 5 years ago

But local protocol will create copy of configured file before starting the post process. I don't need two copies for same file. I have file called 'samp1.fasta' file, from this if i configured local copy, it will create another copy of samp1.fasta. Later it will trigger my post process script sample.sh. But i don't want any other copy of samp1.fasta in my case.

osallou commented 5 years ago

just create a "fake" file ( /opt/fake/triggerbiomaj.txt for example) and use it

nsanilkumar-valluri commented 5 years ago

ok, got it. Thanks

nsanilkumar-valluri commented 5 years ago

@osallou ftp download is failing even with your example file swissprot.properties (https://github.com/genouest/biomaj-data/blob/master/biomaj_data/db_properties/swissprot.properties)

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start self.session.session['status'][flow['name']] = getattr(self, 'wf' + flow['name'])() File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download (file_list, dir_list) = downloader.list() File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list rfile['size'] = int(parts[4]) ValueError: invalid literal for int() with base 10: 'HTML//EN">'

It seems obvious because, lines list has ['<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">', '', '', 'FTP Listing of /pub/databases/uniprot/knowledgebase/ at ftp.uniprot.org', '', '', '', '

FTP Listing of /pub/databases/uniprot/knowledgebase/ at ftp.uniprot.org

', '
', 'Parent Directory
', '
', 'Jul 03 2019 15:52         Link LICENSE -> ../../../LICENSE ', 'Jul 03 2019 14:00         3898 README', 'Jul 03 2019 14:00         8107 RELEASE.metalink', 'Jul 03 2019 14:00    Directory docs', 'Jul 03 2019 14:00          151 reldate.txt', 'Jul 03 2019 14:00        53536 uniprot.xsd', 'Jul 03 2019 14:00    576634324 uniprot_sprot.dat.gz', 'Jul 03 2019 14:00     88666136 uniprot_sprot.fasta.gz', 'Jul 03 2019 14:00    756218436 uniprot_sprot.xml.gz', 'Jul 03 2019 14:00      8288117 uniprot_sprot_varsplic.fasta.gz', 'Jul 03 2019 14:00 102618527180 uniprot_trembl.dat.gz', 'Jul 03 2019 14:00  37149489903 uniprot_trembl.fasta.gz', 'Jul 03 2019 14:00 120526763606 uniprot_trembl.xml.gz', '
', '
', '', '', '']

It fails in very first line parsing. Is there any parameter i am missing to skip this error?

osallou commented 5 years ago

humm.... http is returned , not ftp.... looks like the http protocol is used , though properties file specifies ftp. I gonna have a check

osallou commented 5 years ago

I just did a local test and it worked just fine Which version of biomaj do you use? Which setup : docker micro service or monolitic install?

osallou commented 5 years ago

I also made a quick test using https://raw.githubusercontent.com/genouest/biomaj-data/master/biomaj_data/db_properties/swissprot.properties on osallou/biomaj-docker:latest and it worked fine too.

So if you are using this property file , I do not see why http protocol would be used

nsanilkumar-valluri commented 5 years ago

I am installing latest version of biomaj. Installed it using pip3 install biomaj biomaj-cli biomaj-daemon biomaj-process biomaj-download biomaj-ftp biomaj-release biomaj-user biomaj-zipkin biomaj-core

osallou commented 5 years ago

ok, so using the monolith install I tried with docker setup, but anyway should use latest pip packages.

I gonna try with latest code on monolith so see if a protocol issue could occur in this case (though I do not see what could be the difference)

osallou commented 5 years ago

I tested locally and it works fine too.

2019-07-04 13:17:46,091 DEBUG [root][MainThread] Download:List:ftp://ftp.ncbi.nih.gov/blast/db/FASTA/
2019-07-04 13:18:03,897 DEBUG [root][MainThread] Download:File:RegExp:['^swissprot\\.gz$']
2019-07-04 13:18:03,898 DEBUG [root][MainThread] Download:File:MatchRegExp:swissprot.gz
2019-07-04 13:18:03,898 INFO  [root][MainThread] Workflow:wf_download:nb_files_to_download:1
2019-07-04 13:18:03,899 INFO  [root][MainThread] Workflow:wf_download:release:remoterelease:2019-7-2
2019-07-04 13:18:03,899 INFO  [root][MainThread] Workflow:wf_download:release:release:2019-7-2
2019-07-04 13:18:03,909 DEBUG [root][MainThread] Workflow:wf_download:offline_check_dir:/home/osallou/Development/NOSAVE/genouest/biomaj-test/test/data/biomaj/OfflineDir/swissprot_tmp
2019-07-04 13:18:03,909 DEBUG [root][MainThread] Workflow:wf_download:offline_check_file:swissprot.gz
2019-07-04 13:18:03,910 INFO  [root][MainThread] Workflow:wf_download:nb_expected_files:1
2019-07-04 13:18:03,910 INFO  [root][MainThread] Workflow:wf_download:nb_files_in_offline_dir:0
2019-07-04 13:18:03,910 DEBUG [root][MainThread] Workflow:wf_download:create_dir_structure:start
2019-07-04 13:18:03,911 DEBUG [root][MainThread] Workflow:wf_download:create_dir_structure:done
2019-07-04 13:18:04,003 INFO  [root][MainThread] Use remote: False
2019-07-04 13:18:04,004 INFO  [root][MainThread] Workflow:wf_download:DownloadSession:69a2a620-bcd1-4074-87f2-1e273f1cd869
2019-07-04 13:18:04,005 INFO  [root][MainThread] Workflow:wf_download:Download:Waiting
2019-07-04 13:18:04,005 INFO  [root][MainThread] Workflow:wf_download:RemoteDownload:Waiting
2019-07-04 13:18:04,005 INFO  [root][MainThread] Workflow:wf_download:Download:Threads:FillQueue
2019-07-04 13:18:04,006 INFO  [root][MainThread] Workflow:wf_download:Download:Threads:Start
2019-07-04 13:18:04,006 INFO  [root][Thread-5] Start download thread
2019-07-04 13:18:04,007 DEBUG [root][Thread-5] swissprot request to download from ftp://ftp.ncbi.nih.gov
2019-07-04 13:18:04,007 DEBUG [biomaj][Thread-5] Download
2019-07-04 13:18:04,008 DEBUG [root][Thread-5] FTP:Download
2019-07-04 13:18:04,008 DEBUG [root][Thread-5] FTP:Download:Progress:1/1 downloading file swissprot.gz
2019-07-04 13:18:04,008 DEBUG [root][Thread-5] FTP:Download:Progress:1/1 save as swissprot.gz

We can see in logs

2019-07-04 13:18:04,007 DEBUG [root][Thread-5] swissprot request to download from ftp://ftp.ncbi.nih.gov

ftp is correctly used

In your global.properties, set (or change)

historic.logfile.level=DEBUG

and set all logger/handler log level to DEBUG

Then try to run your update and please send the resulting logs

nsanilkumar-valluri commented 5 years ago

Even my case also, i don't think it is going to HTTP implementation, as you can see it points to ftp.h (list() function) in error trace. All i can see is it has listed html tag lines also into list, along with actual list of files/folders. For files it will work, but as it is looking to parse html tagged line, it was failing. I hope this might help.

nsanilkumar-valluri commented 5 years ago

sure, will send Debug messages.

nsanilkumar-valluri commented 5 years ago

2019-07-04 11:28:02,143 INFO [root][MainThread] Workflow:Skip:depends 2019-07-04 11:28:02,143 INFO [root][MainThread] Workflow:Skip:preprocess 2019-07-04 11:28:02,143 INFO [root][MainThread] Workflow:Skip:release 2019-07-04 11:28:02,144 INFO [root][MainThread] Workflow:Start:download 2019-07-04 11:28:02,144 INFO [root][MainThread] Workflow:wf_download 2019-07-04 11:28:02,144 INFO [root][MainThread] Use remote: False 2019-07-04 11:28:02,144 INFO [root][MainThread] Workflow:wf_download:DownloadSession:500dce93-a43e-48be-b876-870c0f70f523 2019-07-04 11:28:02,144 DEBUG [biomaj][MainThread] Download 2019-07-04 11:28:02,145 INFO [root][MainThread] Workflow:DownloadService:CleanSession 2019-07-04 11:28:02,145 DEBUG [root][MainThread] Download:List:ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ 2019-07-04 11:28:03,135 ERROR [root][MainThread] Workflow:download:Exception:invalid literal for int() with base 10: 'HTML//EN">' Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start self.session.session['status'][flow['name']] = getattr(self, 'wf' + flow['name'])() File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download (file_list, dir_list) = downloader.list() File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list rfile['size'] = int(parts[4]) ValueError: invalid literal for int() with base 10: 'HTML//EN">' 2019-07-04 11:28:03,137 DEBUG [root][MainThread] Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start self.session.session['status'][flow['name']] = getattr(self, 'wf' + flow['name'])() File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download (file_list, dir_list) = downloader.list() File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list rfile['size'] = int(parts[4]) ValueError: invalid literal for int() with base 10: 'HTML//EN">'

2019-07-04 11:28:03,138 ERROR [root][MainThread] Error during task download 2019-07-04 11:28:03,138 INFO [root][MainThread] Workflow:wf_over 2019-07-04 11:28:03,175 INFO [root][MainThread] Notify:none An error occured:

osallou commented 5 years ago

hum strange. The listing is an http list, not a ftp list. The 'HTML//EN"> shows it is html. And I do not experience the problem both on my computer (latest code) and our prod server (little older code). Could it be a curl /pycurl issue? which version of pycurl/curl are you using? Which os?

nsanilkumar-valluri commented 5 years ago

[root@3cd1c9c09f59 /]# curl --version curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.36 zlib/1.2.7 libidn/1.28 libssh2/1.4.3 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz unix-sockets

nsanilkumar-valluri commented 5 years ago

Package python-pycurl-7.19.0-19.el7.x86_64

osallou commented 5 years ago

are you using python2 or 3?

osallou commented 5 years ago

using 3 as you sent cmd pip3 .... :-)

osallou commented 5 years ago

I did a test in a fresh docker and installed biomaj with pip3. It worked fine... :-(

In virtualenv I created to install biomaj packages

so we have same libraries, same install, and I cannot reproduce in any environment using https://raw.githubusercontent.com/genouest/biomaj-data/master/biomaj_data/db_properties/swissprot.properties

Are you sure there is no pb in your global or properties file?

Can you provide your global.properties?

osallou commented 5 years ago

Looking back at issue, I just saw in your result:

"... /pub/databases/uniprot/knowledgebase/ at ftp.uniprot.org"

this is not the swissprot example.... it is a request to uniprot server. Which config file are you using???

or we are not redirected to the same web site....

from biomaj location, what is result of

curl -v https://ftp.ncbi.nih.gov/blast/db/FASTA/
osallou commented 5 years ago

I see easily how to fix this parsing issue, I just wonder why we do not get same results and some uniprot references...

nsanilkumar-valluri commented 5 years ago

@osallou sorry for the late replay. Thanks for your help. Regarding URL, yes first debug statement are different. But later i tried to use same swissprot.properties to confirm the issue. Sorry for the confusion, but i can assure this problem is also with swissprot dataset. For curl command, following is the output we are getting [root@fa19786b45e5 /]# curl -v https://ftp.ncbi.nih.gov/blast/db/FASTA/

nsanilkumar-valluri commented 5 years ago

global.properties file has

[GENERAL]
root.dir=/********/biomaj_data
conf.dir=%(root.dir)s/conf
log.dir=/***/log
process.dir=%(root.dir)s/process
cache.dir=%(root.dir)s/cache
lock.dir=%(root.dir)s/lock
#The root directory where all databases are stored.
#If your data is not stored under one directory hirearchy
#you can override this value in the database properties file.
data.dir=/***/data

db.url=mongodb://127.0.0.1:27017
db.name=biomaj

use_ldap=0
ldap.host=localhost
ldap.port=389
ldap.dn=nodomain

use_elastic=0
#Comma separated list of elasticsearch nodes  host1,host2:port2
elastic_nodes=elasticsearch
elastic_index=biomaj
# Calculate data.dir size stats
data.stats=1

celery.queue=biomaj
celery.broker=mongodb://127.0.0.1:27017/biomaj_celery

auto_publish=1

########################
# Global properties file
#To override these settings for a specific database go to its
#properties file and uncomment or add the specific line you want
#to override.
#----------------
# Mail Configuration
#---------------
#Uncomment thes lines if you want receive mail when the workflow is finished

mail.smtp.host=
#mail.stmp.host=
mail.admin=
mail.from=biomaj@localhost
mail.user=
mail.password=
mail.tls=

#---------------------
#Proxy authentification
#---------------------
#proxyHost=
#proxyPort=
#proxyUser=
#proxyPassword=

#---------------------
# PROTOCOL
#-------------------
#possible values : ftp, http, rsync, local
port=21
username=anonymous
password=anonymous@nowhere.com

#access user for production directories
production.directory.chmod=775
#Number of thread during the download
bank.num.threads=4

#Number of threads to use for downloading and processing
files.num.threads=4

#to keep more than one release increase this value
keep.old.version=0

#Link copy property
do.link.copy=true

#The historic log file is generated in log/
#define level information for output : DEBUG,INFO,WARN,ERR
historic.logfile.level=DEBUG

http.parse.dir.line=<a[\\s]+href="([\\S]+)\\/"[\\s]*>.*([\\d]{4}-[\\w\\d]{2,5}-[\\d]{2}\\s[\\d]{2}:[\\d]{2})
http.parse.file.line=<a[\\s]+href="([\\S]+)"[\\s]*>.*([\\d]{4}-[\\w\\d]{2,5}-[\\d]{2}\\s[\\d]{2}:[\\d]{2}).*([\d\.]+[MKG]{0,1})

http.group.dir.name=1
http.group.dir.date=2
http.group.file.name=1
http.group.file.date=2
http.group.file.size=3

#Needed if data sources are contains in an archive
log.files=true

local.files.excluded=\\.panfs.*

#~40mn
ftp.timeout=2000000
ftp.automatic.reconnect=5
ftp.active.mode=false

# Bank default access
visibility.default=public

#proxy=http://localhost:3128

[loggers]
keys = root, biomaj

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = DEBUG
handlers = console

[logger_biomaj]
level = DEBUG
handlers = console
qualname = biomaj
propagate=0

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = DEBUG
formatter = generic

[formatter_generic]
format = %(asctime)s %(levelname)-5.5s [%(name)s][%(threadName)s] %(message)s
nsanilkumar-valluri commented 5 years ago

Debug report for swissprot dataset

2019-07-04 14:10:09,799 DEBUG [biomaj][MainThread] Download
2019-07-04 14:10:09,800 INFO  [root][MainThread] Workflow:DownloadService:CleanSession
2019-07-04 14:10:09,800 DEBUG [root][MainThread] Download:List:ftp://ftp.ncbi.nih.gov/blast/db/FASTA/
2019-07-04 14:10:10,493 ERROR [root][MainThread] Workflow:download:Exception:invalid literal for int() with base 10: 'HTML//EN">'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start
self.session._session['status'][flow['name']] = getattr(self, 'wf_' + flow['name'])()
  File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download
(file_list, dir_list) = downloader.list()
  File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list
rfile['size'] = int(parts[4])
ValueError: invalid literal for int() with base 10: 'HTML//EN">'
2019-07-04 14:10:10,494 DEBUG [root][MainThread] Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start
self.session._session['status'][flow['name']] = getattr(self, 'wf_' + flow['name'])()
  File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download
(file_list, dir_list) = downloader.list()
  File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list
rfile['size'] = int(parts[4])
ValueError: invalid literal for int() with base 10: 'HTML//EN">'

2019-07-04 14:10:10,495 ERROR [root][MainThread] Error during task download
2019-07-04 14:10:10,495 INFO  [root][MainThread] Workflow:wf_over
2019-07-04 14:10:10,532 INFO  [root][MainThread] Notify:none
An error occured:

Bank update request sent for swissprot
Failed to send update request for swissprot
osallou commented 5 years ago

I gonna check with your global.properties

and what is result of curl ftp:

curl -v ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

please indent your results or attach files, they are hard to read...

osallou commented 5 years ago

I got no issue with your global.properties... You still get an HTTP answer to an FTP request.... Are you behind a proxy?

I think I remember a problem with someone in a company who had this kind of issue. The requests were going out through a proxy, and this proxy do not manage ftp proxy directly, it proxied the ftp request to http requests/connections, leading to different answers....

nsanilkumar-valluri commented 5 years ago

@osallou Thanks for your help. Sorry, next time i will take care about indentation. Yesssss, i am behind my company proxy. Did you remember any resolution for that problem.

osallou commented 5 years ago

So i think the proxy is the issue Can you try the curl ftp cmd to see what is returned?

curl -v ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

If it is the issue, and i think it is, then you cannot use ftp from your company (or ask it team a true direct ftp access to internet from your server).... Workaround is to use http as most of web sites fir vanks propose ftp and http access. However, as http listing is not standard, it means you may have to customize the http regexp properties set in global.properties in your bank property file.

Regexps are used to analyse web listing page and extract file and dir info.

Yoi can try however with default ones and see if they match.

nsanilkumar-valluri commented 5 years ago
* About to connect() to proxy **************.com port 8080 (#0)
*   Trying ****************
* Connected to ***********************.com (10.127.189.154) port 8080 (#0)
> GET ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ HTTP/1.1
> User-Agent: curl/7.29.0
> Host: ftp.ncbi.nih.gov:21
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 200 OK
< Content-Type: text/html
< Transfer-Encoding: chunked
< Proxy-Connection: Keep-Alive
< Connection: Keep-Alive
< Date: Fri, 05 Jul 2019 05:41:44 GMT
<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML>
<HEAD>
<TITLE>FTP Listing of /blast/db/FASTA/ at ftp.ncbi.nih.gov</TITLE>
<BASE HREF="ftp://ftp.ncbi.nih.gov/blast/db/FASTA/">
</HEAD>
<BODY>
<H2>FTP Listing of /blast/db/FASTA/ at ftp.ncbi.nih.gov</H2>
<HR>
<A HREF="../">Parent Directory</A><BR>
<PRE>
Nov 26 2003 00:00        91553 <A HREF="alu.a.gz">alu.a.gz</A>
Jun 15 2009 00:00           43 <A HREF="alu.a.gz.md5">alu.a.gz.md5</A>
Nov 26 2003 00:00        24465 <A HREF="alu.n.gz">alu.n.gz</A>
Jun 15 2009 00:00           43 <A HREF="alu.n.gz.md5">alu.n.gz.md5</A>
Nov 26 2003 00:00      4283092 <A HREF="drosoph.aa.gz">drosoph.aa.gz</A>
Jun 15 2009 00:00           48 <A HREF="drosoph.aa.gz.md5">drosoph.aa.gz.md5</A>
Nov 26 2003 00:00     36924008 <A HREF="drosoph.nt.gz">drosoph.nt.gz</A>
Jun 15 2009 00:00           48 <A HREF="drosoph.nt.gz.md5">drosoph.nt.gz.md5</A>
Jun 23 2019 00:04    967779446 <A HREF="env_nr.gz">env_nr.gz</A>
Jun 23 2019 00:04           44 <A HREF="env_nr.gz.md5">env_nr.gz.md5</A>
Jun 23 2019 11:06  43086728486 <A HREF="env_nt.gz">env_nt.gz</A>
Jun 23 2019 11:25           44 <A HREF="env_nt.gz.md5">env_nt.gz.md5</A>
Mar 17 2019 17:21   1458715296 <A HREF="est_human.gz">est_human.gz</A>
Mar 17 2019 17:21           47 <A HREF="est_human.gz.md5">est_human.gz.md5</A>
Mar 17 2019 18:10    776046470 <A HREF="est_mouse.gz">est_mouse.gz</A>
Mar 17 2019 18:10           47 <A HREF="est_mouse.gz.md5">est_mouse.gz.md5</A>
Jun 23 2019 21:46  11779604082 <A HREF="est_others.gz">est_others.gz</A>
Jun 23 2019 21:51           48 <A HREF="est_others.gz.md5">est_others.gz.md5</A>
Feb 24 2019 12:22   9999571934 <A HREF="gss.gz">gss.gz</A>
Feb 24 2019 12:27           41 <A HREF="gss.gz.md5">gss.gz.md5</A>
Jun 23 2019 09:45   8044464017 <A HREF="htgs.gz">htgs.gz</A>
Jun 23 2019 09:49           42 <A HREF="htgs.gz.md5">htgs.gz.md5</A>
Feb 01 2013 00:00     33709040 <A HREF="igSeqNt.gz">igSeqNt.gz</A>
Feb 01 2013 00:00      4654020 <A HREF="igSeqProt.gz">igSeqProt.gz</A>
Jul 05 2019 03:50     15862667 <A HREF="mito.aa.gz">mito.aa.gz</A>
Jul 05 2019 03:50           45 <A HREF="mito.aa.gz.md5">mito.aa.gz.md5</A>
Jul 05 2019 03:51     73957465 <A HREF="mito.nt.gz">mito.nt.gz</A>
Jul 05 2019 03:51           45 <A HREF="mito.nt.gz.md5">mito.nt.gz.md5</A>
Jul 02 2019 08:01  50774575876 <A HREF="nr.gz">nr.gz</A>
Jul 02 2019 08:20           40 <A HREF="nr.gz.md5">nr.gz.md5</A>
Jun 23 2019 13:27  57513447669 <A HREF="nt.gz">nt.gz</A>
Jun 23 2019 13:52           40 <A HREF="nt.gz.md5">nt.gz.md5</A>
Jun 29 2019 21:25 299422788794 <A HREF="other_genomic.gz">other_genomic.gz</A>
Jun 29 2019 23:31           51 <A HREF="other_genomic.gz.md5">other_genomic.gz.md5</A>
Jun 23 2019 10:01    288307625 <A HREF="pataa.gz">pataa.gz</A>
Jun 23 2019 10:02           43 <A HREF="pataa.gz.md5">pataa.gz.md5</A>
Jun 23 2019 12:27   6000355688 <A HREF="patnt.gz">patnt.gz</A>
Jun 23 2019 12:30           43 <A HREF="patnt.gz.md5">patnt.gz.md5</A>
Jul 02 2019 04:00     21028400 <A HREF="pdbaa.gz">pdbaa.gz</A>
Jul 02 2019 04:00           43 <A HREF="pdbaa.gz.md5">pdbaa.gz.md5</A>
Jul 02 2019 01:00       679928 <A HREF="pdbnt.gz">pdbnt.gz</A>
Jul 02 2019 01:00           43 <A HREF="pdbnt.gz.md5">pdbnt.gz.md5</A>
May 19 2019 06:02    195858975 <A HREF="sts.gz">sts.gz</A>
May 19 2019 06:03           41 <A HREF="sts.gz.md5">sts.gz.md5</A>
Jul 02 2019 04:00    106473461 <A HREF="swissprot.gz">swissprot.gz</A>
Jul 02 2019 04:00           47 <A HREF="swissprot.gz.md5">swissprot.gz.md5</A>
Jan 13 2010 00:00       881144 <A HREF="vector.gz">vector.gz</A>
Nov 26 2003 00:00      1951194 <A HREF="yeast.aa.gz">yeast.aa.gz</A>
Jun 15 2009 00:00           46 <A HREF="yeast.aa.gz.md5">yeast.aa.gz.md5</A>
Nov 26 2003 00:00      3732371 <A HREF="yeast.nt.gz">yeast.nt.gz</A>
Jun 15 2009 00:00           46 <A HREF="yeast.nt.gz.md5">yeast.nt.gz.md5</A>
</PRE>
<HR>
</BODY>
</HTML>
nsanilkumar-valluri commented 5 years ago

Yes, can we contribute to handle this type of case. Because, company proxies (policies) might not be changed for single project.

osallou commented 5 years ago

The problem is proxies that do ftp -> http (though not always supported or allowed), have no standards. It means that returned http will not have the same look depending on used proxy. This prevents biomaj from correctly handling this use case.

As I said, workaround is to use http protocol instead of ftp in those cases. The http regexp parser is not always cool/easy to setup, but usually you only need to define a few use cases. The ones provided in global.properties match some servers, not all... As http file listing is only non standard. If it does not match, you need to find the correct regexp and set them in your bank property file.

If you find a way to handle most cases, we'll be glad to get it in biomaj :-)