bcgsc / mavis

Merging, Annotation, Validation, and Illustration of Structural variants
http://mavis.bcgsc.ca
GNU General Public License v3.0
72 stars 13 forks source link

Issue generating the annotations from Ensembl #205

Closed moldach closed 4 years ago

moldach commented 4 years ago

MAVIS version: 2.2.6

Python version: 3.8.0

OS: CentOS Linux release 7.5.1804 (Core)

I have downloaded helper script from mavis and have followed the installation steps from the ensembl site to verify the connection.

Test connection to API

./test-perlApi-2.sh 
Installation is good. Connection to Ensembl works and you can query the human core database

Run the perl script help menu

perl ~/bin/tools/generate_ensembl_json.pl --output /scratch/moldach/test.json

error: required argument --ensembl_host not provided at /home/moldach/bin/tools/genrate_ensembl_json.pl line 116.

I don't see any reference to --ensembl_host in the MAVIS docs, nor searching the ensembl API docs so I'm wondering what I need to put here?

Furthermore, I want to get the Ensembl annotations for C elegans and not the human so what parameter do I set on the _generate_ensembljson.pl to specify the model organism of interest?

Thank you

creisle commented 4 years ago

Hi @moldach !

The ensembl_host is the host server that is serving the ensembl db/API. We keep a copy locally. It took some searching but I was able to find it in their docs here (it's not easy to find anymore it seems). The perl script isn't required so long as you format the final result to match the JSON expectations you can use another tool or script if you prefer (perl can be difficult to install).

I haven't tried with non-human genomes but from the ensembl website instructions it looks like you may need additional perl modules . You will also need to edit the script itself to look for non-human data. I've linked the relevant line here: https://github.com/bcgsc/mavis/blob/develop/tools/generate_ensembl_json.pl#L136. Assuming the non-human genes are in a similar format in their database it should work from there.

I am pinging @mattdoug604 and @calchoo to comment as well since they have generated non-human genome files for rn6 and mm10 respectively.

moldach commented 4 years ago

Thanks for the follow-up @creisle . Can anyone confirm if the ensembl perl API is working now?

I tried this yesterday but I am no longer getting the Installation is good. Connection to Ensembl works and you can query the human core database when trying the ping_ensembl.pl script so I could not trouble shoot this.

moldach commented 4 years ago

Okay the API is now back up again and I get the Installation is good....

After making changes to vim +136 generate_ensembl_json.pl it seems like it should work.

It looks like I have all of the Perl modules, according to the ensembl website:

#!/bin/bash
#SBATCH --job-name=perlAPI
#SBATCH --time=0:30:0
#SBATCH --mem=500
#SBATCH --output=perlAPI.log
#SBATCH --output=perlAPI.err

PERL5LIB=$HOME/bin/src/BioPerl-1.6.924:$PERL5LIB
PERL5LIB=$HOME/bin/src/ensembl/modules:$PERL5LIB
PERL5LIB=$HOME/bin/src/ensembl-variation/modules:$PERL5LIB
PERL5LIB=$HOME/bin/src/ensembl-compara/modules:$PERL5LIB
PERL5LIB=$HOME/bin/src/ensembl-funcgen/modules:$PERL5LIB
PERL5LIB=$HOME/bin/src/ensembl-tools/modules:$PERL5LIB
PERL5LIB=${PERL5LIB}:/home/moldach/bin/tools
export PERL5LIB

perl ~/bin/tools/generate_ensembl_json.pl --output /scratch/moldach/celegan.json

Now I get the error:

error: required argument --ensembl_host not provided at /home/moldach/bin/tools/generate_ensembl_json.pl line 116.

Okay so that's progress. So you mentioned:

The ensembl_host is the host server that is serving the ensembl db/API. We keep a copy locally. It took some searching but I was able to find it in their docs here (it's not easy to find anymore it seems).

I'm confused as to what you mean here about ensembl_host and a local copy. What do I need to fill in instead (or what do I need to download for a local copy)?:

Looking inside the generate_ensembl_json.pl file, I've done search for host:

    my $ensembl_host = defined $ENV{'ENSEMBL_HOST'} ? $ENV{'ENSEMBL_HOST'} : '';
    my $ensembl_port = defined $ENV{'ENSEMBL_PORT'} ? $ENV{'ENSEMBL_PORT'} : 3306;
    my $ensembl_user = defined $ENV{'ENSEMBL_USER'} ? $ENV{'ENSEMBL_USER'} : 'ensembl';
    my $ensembl_pass = defined $ENV{'ENSEMBL_PASS'} ? $ENV{'ENSEMBL_PASS'} : 

I tried replacing the empty host string with: mysql-eg-publicsql.ebi.ac.uk but now I get this error:

connecting to mysql-eg-publicsql.ebi.ac.uk:3306 as ensembl
DBI connect('host=mysql-eg-publicsql.ebi.ac.uk;port=3306','ensembl',...) failed: Access denied for user 'ensembl'@'lcg-ce2.sfu.computecanada.ca' (using password: YES) at /hom$

-------------------- EXCEPTION --------------------
MSG: Cannot connect to the Ensembl MySQL server at mysql-eg-publicsql.ebi.ac.uk:3306; check your settings & DBI error message: Access denied for user 'ensembl'@'lcg-ce2.sfu.c$
STACK Bio::EnsEMBL::Registry::load_registry_from_db /home/moldach/bin/src/ensembl/modules/Bio/EnsEMBL/Registry.pm:1770
STACK main::main /home/moldach/bin/tools/generate_ensembl_json.pl:120
STACK toplevel /home/moldach/bin/tools/generate_ensembl_json.pl:56
Date (localtime)    = Tue Apr 21 14:31:24 2020
Ensembl API version = 99
---------------------------------------------------
moldach commented 4 years ago

I think that HOST/PORT/USER should be either:

my $ensembl_host = defined $ENV{'ENSEMBL_HOST'} ? $ENV{'ENSEMBL_HOST'} : 'mysql-eg-publicsql.ebi.ac.uk';
    my $ensembl_port = defined $ENV{'ENSEMBL_PORT'} ? $ENV{'ENSEMBL_PORT'} : 4157;
    my $ensembl_user = defined $ENV{'ENSEMBL_USER'} ? $ENV{'ENSEMBL_USER'} : 'anonymous';
    my $ensembl_pass = defined $ENV{'ENSEMBL_PASS'} ? $ENV{'ENSEMBL_PASS'} : '';

or

  my $ensembl_host = defined $ENV{'ENSEMBL_HOST'} ? $ENV{'ENSEMBL_HOST'} : 'ensembldb.ensembl.org';
    my $ensembl_port = defined $ENV{'ENSEMBL_PORT'} ? $ENV{'ENSEMBL_PORT'} : 5306;
    my $ensembl_user = defined $ENV{'ENSEMBL_USER'} ? $ENV{'ENSEMBL_USER'} : 'anonymous';
    my $ensembl_pass = defined $ENV{'ENSEMBL_PASS'} ? $ENV{'ENSEMBL_PASS'} : '';

This gets rid of the error message. The script has been running for +12 hours and this is the output I see for them so far (still running unsuccessfully?):

connecting to mysql-eg-publicsql.ebi.ac.uk:4157 as anonymous
loading 46904 genes
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

I would really appreciate your help in sorting this out @creisle @calchoo @mattdoug604 . Thanks

creisle commented 4 years ago

@moldach how long it runs is probably species and connection dependent. It took a couple hours when we did it for human, but that was with a local instance of the DB so there was no other traffic and maximum connection speed. @mattdoug604 how long did it take when you did this for non-human genes?

@moldach the '.' are output 1 per gene, if they are still being written then likely it is still running

moldach commented 4 years ago

@moldach the '.' are output 1 per gene, if they are still being written then likely it is still running

This was very helpful! I used grep -o -i . myLogFile.err on that log to search for "." and found 19,115 of them had downloaded after 24 hours 😞

Silver lining is that it looks like the process is working. I re-ran the script and although the top of the log says loading 46904 genes the count of "."'s found was 47077. Not sure for the discrepancy there; however, I now have a .JSON file.

In case this is useful for anyone else the job took 36 hours. The model organism is C. elegans and the connection is from an academic HPC so that's strange it took so long.

creisle commented 4 years ago

Glad you were able to get it output!

That does seem particularly long but the retrieval rate is likely affected by the load on the remote ensembl server. Since this is the main public ensembl database access it probably has very high usage. The version we have locally was a dump that we host internally so we were 1 of only a couple of users running scripts against it at the time.

I re-ran the script and although the top of the log says loading 46904 genes the count of "."'s found was 47077. Not sure for the discrepancy there; however, I now have a .JSON file.

I'll take a look at the script and see if it suggests anything that would be the cause of this discrepancy

creisle commented 4 years ago

I couldn't find anything that would suggest why the . might differ, since it is writing stdout it is possible it just didn't flush and a . got duplicated.

Are there any other outstanding issues here? is it ok to close this?

moldach commented 4 years ago

No issue here I guess, thanks