TBroTeam / TBro

Visualization and management of denovo transcriptomes
https://tbroteam.github.io/TBro/
10 stars 6 forks source link

Setting up blast databases in TBro in Docker in Amazon AWS Lightsail #49

Open 000generic opened 7 years ago

000generic commented 7 years ago

Hi! Thanks for your help on the last two issues :)

I'm now having trouble running phing to set up the blast databases.

From the documentation, I'm not sure if I should be running phing at the TBro command line - or at the Ubuntu command line. At the TBro command line, phing can not be run or installed in my hands. At the Ubuntu command line, phing can be installed but does not run successfully.

Also, I don't know how to locate TBro directories when I am not at the TBro command line. Is it possible to enter TBro directories when I am at the Ubuntu command line?

I'm also unsure what you mean by "main TBro directory" in the documentation. Is this the default directory when I start up TBro? Or the directory I created to store my data in?

Details follow:

In TBro:

First I should move to my "main TBro directory" - I am guessing this is the directory I created to store all my data in when I set up TBro...?

cd /squid

I then follow TBro documentation instructions but the phing command is not found when run at the TBro command line:

oot@9b953c8ae04e: /root@9b953c8ae04e:/# phing queue-install-db bash: phing: command not found

When I try to install phing I get the following error:

sudo apt-get update sudo apt-get install phing

Reading package lists... Done Building dependency tree
Reading state information... Done E: Unable to locate package phing

so I'm not sure how to install phing at the TBro command line.

Outside TBro

I am able to install phing at the Ubuntu command line but I don't know how to locate my main TBro directory inside of Docker - is it possible to enter TBro directories from outside Docker/TBro? When I run the required phing command in different folders I get the following error:

ubuntu@ip-172-26-13-108:/$ phing queue-install-db Buildfile: build.xml does not exist!

I'm not sure what this error is in reference to but maybe its related to being in the wrong directory?

So I'm not sure 1) what directory I should be running phing in, 2) if I should run phing the Ubuntu or in TBro command line, and 3) if I should run phing in TBro, I'm not sure how to install it.

Any suggestions would be greatly appreciated.

Thank-you

phryneas commented 7 years ago

you should run phing from inside the TBro main container - if you installed everything according to the documentation, you can enter that container using docker exec -it TBro_official /bin/bash and that container should already contain an installation of phing.

if you are still missing it, you should be able to install it using composer global require phing/phing

inside that container, running phing database-initialize should be possible from /home/tbro

in any case, the folder to run phing from is the folder where the build.xml is located in.

000generic commented 7 years ago

I tried to follow your installation directions closely when I installed TBRo - and I have reinstalled repeatedly without success when I get to the phing command.

Phing appears to be installed but a 'command not found' error is given when I try to run phing in the TBro directory that has the build.xml file - or in any directory in TBro.

Following your directions above:

ubuntu@ip-172-26-13-108:~$ docker exec -it TBro_official /bin/bash

oot@9b953c8ae04e: /root@9b953c8ae04e:/# phing queue-install-db bash: phing: command not found

oot@9b953c8ae04e: /root@9b953c8ae04e:/# composer global require phing/phing Changed current directory to /root/.composer Running composer as root/super user is highly discouraged as packages, plugins and scripts cannot always be trusted Using version ^2.16 for phing/phing ./composer.json has been updated Loading composer repositories with package information Updating dependencies (including require-dev) Nothing to install or update Generating autoload files

oot@9b953c8ae04e: /root@9b953c8ae04e:/# cd home/tbro/ oot@9b953c8ae04e: /home/tbroroot@9b953c8ae04e:/home/tbro# phing database-initialize bash: phing: command not found

oot@9b953c8ae04e: /home/tbroroot@9b953c8ae04e:/home/tbro# ls INSTALLATION build.xml doc src README.md build_installation.sh enable_AllowOverride_Apache2.sed test build.properties composer.json phpunit.xml update_config.sed build.properties.example composer.lock queue_config.example.sql update_installation.sh

oot@9b953c8ae04e: /home/tbroroot@9b953c8ae04e:/home/tbro# phing queue-install-db bash: phing: command not found

...I just realized the file phing will generate is already in the directory - queue_config.example.sql I'm not sure how it was generated, as I still haven't gotten phing to work but I'll try working with it.

...it looks like queue_config.example.sql was generated when things were built by docker exec -i -t TBro_official /home/tbro/build_installation.sh

screen output:

Buildfile: /home/tbro/build.xml [property] Loading /home/tbro/./build.properties

tbro > queue-install-db:

 [copy] Copying 1 file to /home/tbro
 [echo] an example configuration has been copied to /home/tbro/queue_config.example.sql!
 [echo] modify it to your needs and load it into your blast database

BUILD FINISHED

Total time: 0.4795 seconds

000generic commented 7 years ago

Two new questions:

1) To move zipped blast database files into my docker container using

curl --data-binary --ftp-pasv --user "$WORKERFTP_FTP_USER":"$WORKERFTP_FTP_PW" -T cannabis_sativa_transcriptome.zip ftp://$WORKERFTP_IP/

how can I determine what the values of the three variables

$WORKERFTP_FTP_USER $WORKERFTP_FTP_PW $WORKERFTP_IP

are for my Docker container?

When I run set in TBro, nothing shows up for the three variables:

oot@dddabdc84640: /root@dddabdc84640:/# set | grep WORKERFTP WORKERFTP_ENV_FTP_PW=ftp WORKERFTP_ENV_FTP_USER=tbro WORKERFTP_NAME=/TBro_official/WORKERFTP WORKERFTP_PORT=tcp://172.17.0.4:21 WORKERFTP_PORT_21_TCP=tcp://172.17.0.4:21 WORKERFTP_PORT_21_TCP_ADDR=172.17.0.4 WORKERFTP_PORT_21_TCP_PORT=21 WORKERFTP_PORT_21_TCP_PROTO=tcp

so I'm not sure where to find values for the them.

I tried

$WORKERFTP_ENV_FTP_USER $WORKERFTP_ENV_FTP_PW $WORKERFTP_PORT_21_TCP_ADDR

in the curl command but it didn't seem to work:

curl --data-binary --ftp-pasv --user “tbro”:”ftp” -T blastdb-Harvard-AA.zip ftp://172.17.0.4/

curl: (67) Access denied: 530 when run from Ubuntu curl: (6) Could not resolve host: tbro when run from TBro

2) How do I "run the queue_config.sql commands in your queue database." ?

Thank-you!

phryneas commented 7 years ago

I think I'll ping @greatfireball or @iimog on this, this is getting too specialized with the setup for me now, as they created the docker containers.

iimog commented 7 years ago

Hi @000generic, sorry for the confusion. I think the documentation needs some serious improvements. First of all "the main TBro directory" is indeed /home/tbro/ so the directory containing the source code (I will clarify that in the docs). phing is installed via composer so it is available in ~/.composer/vendor/bin this is added to the path via the ~/.bash_profile which is apparently not loaded when entering the container. You can fix that by either entering: source ~/.bash_profile or export PATH=~/.composer/vendor/bin:$PATH But anyway you are right phing queue-install-db is already executed when following the installation instructions (by build_installation.sh)

You are also right regarding the environment variables (I will update them in the docs). However, the curl command should work from TBro. Can you please try again this one:

curl --data-binary --ftp-pasv --user $WORKERFTP_ENV_FTP_USER:$WORKERFTP_ENV_FTP_PW -T blastdb-Harvard-AA.zip ftp://"$WORKERFTP_PORT_21_TCP_ADDR"/

To import the content of the queue_config.sql file into the queue database execute (from TBro):

PGPASSWORD=$WORKER_ENV_DB_PW psql -U $WORKER_ENV_DB_USER -h $WORKER_PORT_5432_TCP_ADDR -p $WORKER_PORT_5432_TC
P_PORT <queue_config.sql
000generic commented 7 years ago

Getting closer....

I was able to run both the curl and PGPASSWORD commands successfully now - but nothing is showing up in TBro as a blast database to blast against. Specifically, I did the following:

cd /sono/peptides # this is where I placed by zipped blast databases curl --data-binary --ftp-pasv --user tbro:ftp -T blastdb-barnacle-AA.zip ftp://172.17.0.4/ curl --data-binary --ftp-pasv --user tbro:ftp -T blastdb-barnacle-TR.zip ftp://172.17.0.4/

cd /home/tbro mv queue_config.example.sql queue_config.sql nano queue_config.sql

-- database files available. name is the name it will be referenced by, md5 is the zip file's sum, download_uri specifies where the file can be retreived INSERT INTO database_files (name, md5, download_uri) VALUES ('blastdb-barnacle-AA', '50e7cb5a77f37641a648edc59abcc11a', 'ftp://172.17.0.4/blastdb-barnacle-AA.zip'), ('blastdb-barnacle-TR', '7fc500cce7bb9ac925c39e5d1f986640', 'ftp://172.17.0.4/blastdb-barnacle-TR.zip’);

...etc

-- contains information which program is available for which program. -- additionally, 'availability_filter' can be used to e.g. restrict use for a organism-release combination INSERT INTO program_database_relationships (programname, database_name, availability_filter) VALUES ('blastn','blastdb-barnacle-TR', 'barnacle-T1'), ('blastp','blastdb-barnacle-AA', 'barnacle-T1'), ('blastx','blastdb-barnacle-AA', 'barnacle-T1'), ('tblastn','blastdb-barnacle-TR', 'barnacle-T1'), ('tblastx','blastdb-barnacle-TR', 'barnacle-T1’);

...etc

PGPASSWORD=worker psql -U worker -h 172.17.0.3 -p 5432 <queue_config.sql

I then tried to blast in TBro but no databases were offered as an option.

iimog commented 7 years ago

So sorry, another lack of documentation. Whether a database shows up in TBro only depends on the queue_config.sql and specifically the section program_database_relationships. Here the availability_filter is key (and totally undocumented). This column decides for which organism and release which blast database is shown. The format of this column is {organism_id}_{release} so in case of the demo data the organism_id is "13" and the release is "1.CasaPuKu" so for the blast db to show up the availability_filter had to be set to 13_1.CasaPuKu. If "barnacle-T1" is your release and 14 is your organism_id (you can check with tbro-db organism list) you have to change the availability filter in queue_config.sql to 14_barnacle-T1. In order to import this file into the database again you have to remove all sections except the program_database_relationships (otherwise you get errors due to duplicate key value violating unique constraints). I will add the documentation for the availability_filter column both to the example sql file and the documentation on readthedocs.

Thank you very much for your endurance and for reporting all the problems. This helps a lot in improving the documentation.

000generic commented 7 years ago

Great! Now the blast databases are showing up in TBro - Thank-you :)

....but I think I have to correct the uri I am giving TBro, which I had guessed at after curling my zipped Blast database files into Docker.

curl --data-binary --ftp-pasv --user tbro:ftp -T blastdb-barnacle-AA.zip ftp://172.17.0.4/ curl --data-binary --ftp-pasv --user tbro:ftp -T blastdb-barnacle-TR.zip ftp://172.17.0.4/

When I then configure the queue.config.sql file with:

('barnacle-AA4', '50e7cb5a77f37641a648edc59abcc11a', 'ftp://172.17.0.4/blastdb-barnacle-AA.zip'), ('barnacle-TR4', '7fc500cce7bb9ac925c39e5d1f986640', 'ftp://172.17.0.4/blastdb-barnacle-TR.zip');

TBro throws an error:

There has been an error processing your job. Please review your job. If this keeps happening, notify the administrator.

These errors occured: BLAST Database error: No alias or index file found for protein database [/tmp/queue-worker//barnacle-AA4.50e7cb5a77f37641a648edc59abcc11a/barnacle-AA4] in search path [/tmp/queue-worker::]

and when I configure the queue.config.sql file with:

('barnacle-AA5', '50e7cb5a77f37641a648edc59abcc11a', 'http://172.17.0.4/blastdb-barnacle-AA.zip'), ('barnacle-TR5', '7fc500cce7bb9ac925c39e5d1f986640', 'http://172.17.0.4/blastdb-barnacle-TR.zip');

TBro seems to hang up:

Blast Results

Your job is currently being processed. Please wait a moment. This page will refresh in 2 seconds.

The page does an initial refresh saying it is one of one in queue - and then doesn't seem to refresh any more - and remains stalled after many minutes.

iimog commented 7 years ago

OK, now we are really closing in on this. The blastdb is visible in TBro and the download of the zip file seems to work as well. The ftp configuration is the correct one. The problem now is that after unpacking the zip file the blastdb files are not found. How are those named? TBro expects the blastdb files in the zip to be named the same as the name in the database_files table so in your case (this is barnacle-AA3 and barnacle-TR3, right?) TBro will look for files barnacle-AA3.phr, barnacle-AA3.pin, barnacle-AA3.psq in your zip folder. If they are named differently they will not be found. My suggestion: first clean up the old values from database_files and program_database_relationships table by executing this command:

PGPASSWORD=$WORKER_ENV_DB_PW psql -U $WORKER_ENV_DB_USER -h $WORKER_PORT_5432_TCP_ADDR -p $WORKER_PORT_5432_TCP_PORT -d $WORKER_ENV_DB_NAME -c 'TRUNCATE database_files CASCADE'

use with care as it will also remove all past and present blast jobs from the database.

Then re-import an sql file with the two sections for database_files and program_database_relationships with fixed name column where it corresponds to the name of the blastdb (without .p* or .n* ending). If you verify that it works I will update docu here as well.

000generic commented 7 years ago

Genius! That works great - now I am blasting against my blast databases :)

....however, while the blast hits show visual alignments with many good hits, they are are not showing any isoform information (instead under the Name column in the blast report the hits all say 'No') and the link in 'No' just goes to the TBro landing page.

screen shot 2017-05-04 at 6 14 22 am

This is true even when blasting, for instance, a protein that was used in building the blast database. When I search the same protein in TBro based on its id, the protein is returned as an isoform with a link that takes me to its TBro webpage. I checked and identifiers used in 1) the imported fasta files, 2) imported identifiers, 3) imported .tbl files, and4) in fastas used to build the blast databases are all the same. For instance:

barnacle-ee100-aa

screen shot 2017-05-04 at 6 45 05 am

So, it seems like the blast job is successful but the hits generated are not linking back to the TBro databases. I'll try rebuilding TBro again from the ground up but most likely I need to modify something somewhere along the way.

Once we have all this worked out, I can generalize the steps and provide them to you - or post in GitHub etc. I think the combination of free/cheap easy up/easy down Amazon cloud + TBro is really great. Rather than a long-term repository, often times its useful to make things available to collaborators (or myself) with many updates for just a few days to months, and I think the Amazon/Docker/TBro combo is going to be a great way to do this. There is already growing interest from others here at the Marine Biological Laboratory.

iimog commented 7 years ago

Nice! Happy to hear that it is finally working. The issue with showing "No" in the Name column is indeed very strange. TBro tries to map the name of the blast hit to an internal ID but even if it fails it should still show the original ID of the hit. This ID is parsed out of the Blast result xml. Something seems to go wrong there. Would you mind sharing the xml result? You can get this by calling the webservice directly via: http://<your-tbro-machine>/ajax/queue/job_results?jobid=<your-jobid> replacing both your-tbro-machine and your-jobid with the respective values. The jobid is the one you get when starting a blast job. If you do not want to share this file you can have a look yourself. In the <Hit_def> tag the first word is assumed to be the ID. For an example blast job on the public instance a Hit_def line might look like this:

<Hit_def>cds.comp234028_c1.1_seq4|m.808277 comp234028_c1.1_seq4|g.808277  ORF comp234028_c1.1_seq4|g.808277 comp234028_c1.1_seq4|m.808277 type:complete len:725 (+) comp234028_c1.1_seq4:254-2428(+)</Hit_def>

How does a Hit_def line look in your blast result xml?

I hope to sort out this last problem as well. A step by step guide for TBro on AWS would be really cool. If you don't mind I'd suggest including it as a separate section in the official documentation. Your contribution in improving and disseminating TBro is very much appreciated.

000generic commented 7 years ago

Sure - here is the xml file. It looks like 'No' is short for 'No definition line' - I'll try rebuilding things and see if I can get lucky and solve anything.

{ "job_status": "PROCESSED", "additional_data": { "organism": "16", "release": "barnacle-T1" }, "processed_results": [ { "query": ">barnacle-ee100\nTTAGGAGCAAATGAAAAGAAGAAAGCTGGAAAAAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC", "status": "PROCESSED", "result": "<?xml version=\"1.0\"?>\n<!DOCTYPE BlastOutput PUBLIC \"-\/\/NCBI\/\/NCBI BlastOutput\/EN\" \"http:\/\/www.ncbi.nlm.nih.gov\/dtd\/NCBI_BlastOutput.dtd\">\n\n blastn<\/BlastOutput_program>\n BLASTN 2.2.28+<\/BlastOutput_version>\n Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.<\/BlastOutput_reference>\n \/tmp\/queue-worker\/\/blastdb-barnacle-TR.7fc500cce7bb9ac925c39e5d1f986640\/blastdb-barnacle-TR<\/BlastOutput_db>\n Query_1<\/BlastOutput_query-ID>\n barnacle-ee100<\/BlastOutput_query-def>\n 467<\/BlastOutput_query-len>\n \n \n 0.1<\/Parameters_expect>\n 2<\/Parameters_sc-match>\n -3<\/Parameters_sc-mismatch>\n 5<\/Parameters_gap-open>\n 2<\/Parameters_gap-extend>\n L;m;<\/Parameters_filter>\n <\/Parameters>\n <\/BlastOutput_param>\n\n\n 1<\/Iteration_iter-num>\n Query_1<\/Iteration_query-ID>\n barnacle-ee100<\/Iteration_query-def>\n 467<\/Iteration_query-len>\n\n\n 1<\/Hit_num>\n barnacle-ee100<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee100<\/Hit_accession>\n 467<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 843.46<\/Hsp_bit-score>\n 934<\/Hsp_score>\n 0<\/Hsp_evalue>\n 1<\/Hsp_query-from>\n 467<\/Hsp_query-to>\n 1<\/Hsp_hit-from>\n 467<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n 1<\/Hsp_hit-frame>\n 467<\/Hsp_identity>\n 467<\/Hsp_positive>\n 0<\/Hsp_gaps>\n 467<\/Hsp_align-len>\n TTAGGAGCAAATGAAAAGAAGAAAGCTGGAAAAAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC<\/Hsp_qseq>\n TTAGGAGCAAATGAAAAGAAGAAAGCTGGAAAAAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC<\/Hsp_hseq>\n |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n\n 2<\/Hit_num>\n barnacle-ee232648<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee232648<\/Hit_accession>\n 493<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 547.707<\/Hsp_bit-score>\n 606<\/Hsp_score>\n 7.56147e-155<\/Hsp_evalue>\n 68<\/Hsp_query-from>\n 467<\/Hsp_query-to>\n 3<\/Hsp_hit-from>\n 406<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n 1<\/Hsp_hit-frame>\n 364<\/Hsp_identity>\n 364<\/Hsp_positive>\n 4<\/Hsp_gaps>\n 404<\/Hsp_align-len>\n ACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGG----TGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC<\/Hsp_qseq>\n ACACAAAATCCGGACAAAACCGCACGATGGACATGTTTTGGTCGTTCACAAAGCAGAACCTCTCCTGAGGAAAACACTCCGGAACCTAGCGAAGGTCTTGGCATGGGAAATGACGAGCGTTTGGGATTTGCAAAATTTGCACAATGTGTCTAAGAAACAGATGACCGACACATTTTGCTACGATATGCTGCGAAAAAATTGCCGCTGGCACCTCCCATTTGTTCAAGAATGGGGTTGTTTTGAAAAGGAAAATGGTACCTACTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTGGTGCCAGAGAAGACCTACGAAATACTCACGGATGGTGTCGATGCAAGCTATCGACTTTCAGGGAGAAACGGATGTTGCCAGGAAGCTCGTCAAACCTAACTCCGAC<\/Hsp_hseq>\n ||||||||||| || |||||| ||||||||||| |||| || |||| |||||||||||| ||||| ||||||||||||||| |||| |||||||||||||||||||| ||||||||||| |||| |||||||||||||||||||||||||| ||||||||| | |||||||||| |||||| ||||| ||||||||| || | ||||||| ||||||||||||| ||||||||||||||||||||||||||||| | |||||||||||||||||||||||||||||||||||| | |||||||||||||||| |||||||||||||||||||||||||||||| ||| ||||||||||||| ||||||||||| || |||||||||||||||| ||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n\n 3<\/Hit_num>\n barnacle-ee310756<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee310756<\/Hit_accession>\n 475<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 462.949<\/Hsp_bit-score>\n 512<\/Hsp_score>\n 2.47404e-129<\/Hsp_evalue>\n 33<\/Hsp_query-from>\n 353<\/Hsp_query-to>\n 155<\/Hsp_hit-from>\n 475<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n 1<\/Hsp_hit-frame>\n 295<\/Hsp_identity>\n 295<\/Hsp_positive>\n 0<\/Hsp_gaps>\n 321<\/Hsp_align-len>\n AAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGA<\/Hsp_qseq>\n AAGAGGCAGATCTGGAGCAAATAGCTTTCTTTTACACACAAAATCCGGACAAAACCGCACGATGGACAGGTTTTGGCCGTTTACAAAGCAGAACCTCTCCTGAGGAAAACACTCCGGAACCCAGCGAAGGTCTTGGCATTGGGAATGACGAGCGTTCGGGATTTGCAAATTTTGCAAAATGTGTCTAAGAAACAGATGGCCGACACATTCTGCTACGATATGCTGCGAAAAATTTGCCCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGA<\/Hsp_hseq>\n |||||||||||||| ||| |||| ||||||||| ||||||||||| || |||||| |||||||||||||||| |||||||||||||||||||| ||||| ||||||||||||||| |||| ||||||||||||||||| |||||||||||||| | || ||||||||| |||||| ||||||||| ||||||||| | ||||||||||||||||| ||||| |||||||||||| | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n\n 4<\/Hit_num>\n barnacle-ee7959<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee7959<\/Hit_accession>\n 612<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 354.747<\/Hsp_bit-score>\n 392<\/Hsp_score>\n 9.23619e-97<\/Hsp_evalue>\n 33<\/Hsp_query-from>\n 288<\/Hsp_query-to>\n 358<\/Hsp_hit-from>\n 612<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n 1<\/Hsp_hit-frame>\n 233<\/Hsp_identity>\n 233<\/Hsp_positive>\n 1<\/Hsp_gaps>\n 256<\/Hsp_align-len>\n AAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTG<\/Hsp_qseq>\n AAGAGGCAGATCTGGAGCAAATAGCTTTCTTTTACACACAAAATCCCGACAAAACCGCACGATGGACAGGTTTTGGCCGTTTACAAAGCAGAACCTCTCCTGAGGAAAACACTCCGCAACCCAGCGAAGGTCTCGGCATGGGGAATGAAGAGCGTTTGGGATTTGCAAAATTTGCA-AATGTGTCTAAGAAACAGATGGCCGACACATTCTGCTACGATATGCTGCGAAAAATTTGCCCCTGGCACTTCCCATTTG<\/Hsp_hseq>\n |||||||||||||| ||| |||| ||||||||| |||||||||||||| |||||| |||||||||||||||| |||||||||||||||||||| ||||| |||||||||||||||||||| ||||||||||| |||||||||||||| ||||| |||| |||||||||||||||| ||||||||| ||||||||| | ||||||||||||||||| ||||| |||||||||||| | ||||||||||||||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n\n 5<\/Hit_num>\n barnacle-ee288238<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee288238<\/Hit_accession>\n 620<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 336.713<\/Hsp_bit-score>\n 372<\/Hsp_score>\n 2.47841e-91<\/Hsp_evalue>\n 220<\/Hsp_query-from>\n 467<\/Hsp_query-to>\n 619<\/Hsp_hit-from>\n 373<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n -1<\/Hsp_hit-frame>\n 224<\/Hsp_identity>\n 224<\/Hsp_positive>\n 1<\/Hsp_gaps>\n 248<\/Hsp_align-len>\n AGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC<\/Hsp_qseq>\n AGAAACAGATGGCCGACACATTCTGCTACGATATTATGCAAACAAATTACTCCTGGCAATTCCCGTTTGTTCAGGAATGGGGTCGTTTTGAAAAGGGAAATGGTGCTTTGGATGTCGGCGCCAGCC-TTCGTGAAGGAGTTGGCGCCAGAGAATACCTACAAAATGCTCATGAATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCAAATGTGGCCGGGGAGCTCGTCAAACCTAAATCCGAC<\/Hsp_hseq>\n ||||||||| | ||||||||||||||||| |||| |||| || || |||||||||||| ||||| |||||||||||||||||| |||||||||||| ||||||||||| ||| ||||||||||| | ||| |||||||||| ||||||||||| ||||||||||| |||| | |||||||||||||||||||||||||||||||||||||| |||| ||||||||||||||||||||||||||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n\n 6<\/Hit_num>\n barnacle-ee34877<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee34877<\/Hit_accession>\n 457<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 233.921<\/Hsp_bit-score>\n 258<\/Hsp_score>\n 2.17598e-60<\/Hsp_evalue>\n 33<\/Hsp_query-from>\n 277<\/Hsp_query-to>\n 249<\/Hsp_hit-from>\n 3<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n -1<\/Hsp_hit-frame>\n 203<\/Hsp_identity>\n 203<\/Hsp_positive>\n 8<\/Hsp_gaps>\n 250<\/Hsp_align-len>\n AAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTT---TGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGC--GAAAAATTTACTCCTGGCA<\/Hsp_qseq>\n AAGAGGCAAATCTGGAGCGAATAGCTTTCCTTCGGACACAAAATCCC---AACACCACATAATAGACTGGCTGCTTAGGCCGTTTACAAAGCAAAAGCTCTCTTGAGGAAAACACTCCGCAACCCAGCGAATGTTTTGGCATGGGGAATGTTGAGCGTTTGGAATTTGCAAAATTTGAACGGCGTGTCTCAGAAACAGATGGCCAACACATTCTGCTACGATATTATGCAAAAAAAAATTACCCCTGGCA<\/Hsp_hseq>\n |||||||| ||||| |||||||| |||| || |||||||||||| || |||||| || ||| || | | |||||||||||||||| || |||| |||||||||||||||||||| |||||| || ||||||||||||||| ||||| ||||||||||||||||||| || |||||| ||||||||| | || |||||||||||||| |||| |||| ||||| |||| |||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n\n 7<\/Hit_num>\n barnacle-ee294988<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee294988<\/Hit_accession>\n 2129<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 53.584<\/Hsp_bit-score>\n 58<\/Hsp_score>\n 4.21178e-06<\/Hsp_evalue>\n 1<\/Hsp_query-from>\n 32<\/Hsp_query-to>\n 1333<\/Hsp_hit-from>\n 1302<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n -1<\/Hsp_hit-frame>\n 31<\/Hsp_identity>\n 31<\/Hsp_positive>\n 0<\/Hsp_gaps>\n 32<\/Hsp_align-len>\n TTAGGAGCAAATGAAAAGAAGAAAGCTGGAAA<\/Hsp_qseq>\n TTAGGAGCCAATGAAAAGAAGAAAGCTGGAAA<\/Hsp_hseq>\n |||||||| |||||||||||||||||||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n\n 8<\/Hit_num>\n barnacle-ee265222<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee265222<\/Hit_accession>\n 484<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 53.584<\/Hsp_bit-score>\n 58<\/Hsp_score>\n 4.21178e-06<\/Hsp_evalue>\n 1<\/Hsp_query-from>\n 32<\/Hsp_query-to>\n 318<\/Hsp_hit-from>\n 349<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n 1<\/Hsp_hit-frame>\n 31<\/Hsp_identity>\n 31<\/Hsp_positive>\n 0<\/Hsp_gaps>\n 32<\/Hsp_align-len>\n TTAGGAGCAAATGAAAAGAAGAAAGCTGGAAA<\/Hsp_qseq>\n TTAGGAGCCAATGAAAAGAAGAAAGCTGGAAA<\/Hsp_hseq>\n |||||||| |||||||||||||||||||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n\n 9<\/Hit_num>\n barnacle-ee316353<\/Hit_id>\n No definition line<\/Hit_def>\n barnacle-ee316353<\/Hit_accession>\n 626<\/Hit_len>\n \n \n 1<\/Hsp_num>\n 40.9604<\/Hsp_bit-score>\n 44<\/Hsp_score>\n 0.0265793<\/Hsp_evalue>\n 84<\/Hsp_query-from>\n 125<\/Hsp_query-to>\n 506<\/Hsp_hit-from>\n 547<\/Hsp_hit-to>\n 1<\/Hsp_query-frame>\n 1<\/Hsp_hit-frame>\n 34<\/Hsp_identity>\n 34<\/Hsp_positive>\n 0<\/Hsp_gaps>\n 42<\/Hsp_align-len>\n AAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAA<\/Hsp_qseq>\n AAACGACATGATGACCAGGCTTGAGAAGTTTACAAAGCAGAA<\/Hsp_hseq>\n |||| ||| |||| |||| ||| | |||||||||||||||<\/Hsp_midline>\n <\/Hsp>\n <\/Hit_hsps>\n<\/Hit>\n<\/Iteration_hits>\n \n \n 192231<\/Statistics_db-num>\n 134919102<\/Statistics_db-len>\n 28<\/Statistics_hsp-len>\n 56866582326<\/Statistics_eff-space>\n 0.41<\/Statistics_kappa>\n 0.625<\/Statistics_lambda>\n 0.78<\/Statistics_entropy>\n <\/Statistics>\n <\/Iteration_stat>\n<\/Iteration>\n<\/BlastOutput_iterations>\n<\/BlastOutput>\n\n", "errors": "" } ] }

iimog commented 7 years ago

Thanks for sharing. I think I found the problem. When creating a blast database from fasta via makeblastdb there is an option called -parse_seqids by default this is not set. Hence the ids of entries in the blastdb are randomly generated and the whole fasta header (id + desc, everything after >) is stored in the def of the entry. This is why TBro parses the first word from the <Hit_def>. However, if the -parse_seqids option is used the fasta id (first word after >) is used as id and only the rest of the line (in your case, nothing) is stored in def.

So when you are rebuilding could you please re-generate the blast databases without the parse_seqids flag.

I think in general it is more appropriate to have blast databases that use -parse_seqids and hence have the id in <Hit_id>. But as this will break backwards compatibility I will schedule this change for version 1.2.0. I will open a separate issue for that.

000generic commented 7 years ago

Great detective work! I haven't tested it yet but I think that makes sense.

I agree, its generally more appropriate/useful to have a blast database that is setup with -parse_seqids For instance, I believe it enables blastdbcmd to pull sequences out of the database using a single familiar identifier that is used throughout a workflow - so it makes working with a blast database at the command line much easier - but no problem to make a separate database just for TBro. And it would be great to add a flag for blast databases that are +/- -parse_seqids added to TBro setup in v1.2.0!

I'll test a new blast database set up in TBro next...

It works fantastic!!!

Now on to Expression Search :)