choderalab / ensembler

Automated omics-scale protein modeling and simulation setup.
http://ensembler.readthedocs.io/
GNU General Public License v2.0
52 stars 21 forks source link

Issue while downloading sifts files #62

Closed msultan closed 8 years ago

msultan commented 8 years ago

I was trying to use the quickmodel command and it kept failing while trying to download the sifts files.

I am using the latest conda build and here is the command that I ran

ensembler quickmodel --target_uniprot_entry_name ABL1_HUMAN --uniprot_domain_regex '^Protein kinase' --template_pdbids 2SRC

Here is the error

Querying UniProt web server...
Number of entries returned from initial UniProt search: 2

Set of unique domain names returned from the initial UniProt search using the query string 'mnemonic:ABL1_HUMAN':
set(['SH2', 'SH3', 'Protein kinase'])

Unique domain names selected after searching with the case-sensitive regex string '^Protein kinase':
set(['Protein kinase'])

Done.
Downloading PDB file for: 2SRC
Downloading sifts file for: 2SRC
ERROR downloading SIFTS file with PDB ID: 2SRC
Traceback (most recent call last):
  File "/home/msultan/software/anaconda/bin/ensembler", line 6, in <module>
    sys.exit(main())
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/cli.py", line 40, in main
    command.dispatch(args)
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/cli_commands/quickmodel.py", line 106, in dispatch
    QuickModel(targetid=args['--targetid'], templateids=templateids, target_uniprot_entry_name=args['--target_uniprot_entry_name'], uniprot_domain_regex=args['--uniprot_domain_regex'], pdbids=pdbids, chainids=chainids_dict, template_uniprot_query=args['--template_uniprot_query'], template_seqid_cutoff=template_seqid_cutoff, loopmodel=not args['--no-loopmodel'], package_for_fah=args['--package_for_fah'], nfahclones=nfahclones, structure_dirs=structure_paths)
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/tools/quick_model.py", line 78, in __init__
    ensembler.initproject.gather_templates_from_pdb(self.pdbids, self.uniprot_domain_regex, chainids=self.chainids, structure_dirs=self.structure_dirs)
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/utils.py", line 37, in print_done
    fn(*args, **kwargs)
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/initproject.py", line 373, in gather_templates_from_pdb
    get_structure_files_for_single_pdbchain(pdbid, structure_dirs)
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/initproject.py", line 493, in get_structure_files_for_single_pdbchain
    pdbid, project_structure_filepath, structure_type=structure_type
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/initproject.py", line 451, in download_structure_file
    download_sifts_file(pdbid, project_structure_filepath)
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/initproject.py", line 463, in download_sifts_file
    sifts_page = ensembler.pdb.retrieve_sifts(pdbid)
  File "/home/msultan/software/anaconda/lib/python2.7/site-packages/ensembler/pdb.py", line 68, in retrieve_sifts
    response = urlopen(url)
  File "/home/msultan/software/anaconda/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/home/msultan/software/anaconda/lib/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/home/msultan/software/anaconda/lib/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/home/msultan/software/anaconda/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/home/msultan/software/anaconda/lib/python2.7/urllib2.py", line 1412, in ftp_open
    fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
  File "/home/msultan/software/anaconda/lib/python2.7/urllib2.py", line 1434, in connect_ftp
    persistent=False)
  File "/home/msultan/software/anaconda/lib/python2.7/urllib.py", line 875, in __init__
    self.init()
  File "/home/msultan/software/anaconda/lib/python2.7/urllib.py", line 887, in init
    self.ftp.cwd(_target)
  File "/home/msultan/software/anaconda/lib/python2.7/ftplib.py", line 562, in cwd
    return self.voidcmd(cmd)
  File "/home/msultan/software/anaconda/lib/python2.7/ftplib.py", line 254, in voidcmd
    return self.voidresp()
  File "/home/msultan/software/anaconda/lib/python2.7/ftplib.py", line 229, in voidresp
    resp = self.getresp()
  File "/home/msultan/software/anaconda/lib/python2.7/ftplib.py", line 224, in getresp
    raise error_perm, resp
urllib2.URLError: <urlopen error ftp error: 550 Failed to change directory.>

I have reproduced the error on two different clusters and with a variety of template ids. As of this post, the sifts page for 2src seems to be up as well.

jchodera commented 8 years ago

This could be due to another change in the SIFTS interface.

@danielparton: Any chance you have a moment to take a quick look at this?

If not, I can take a stab at trying to figure out what changed this weekend.

jchodera commented 8 years ago

OK, that's weird. I see that I can get at the URL ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/xml/ just fine, and the constructed URL including 2SRC should be: ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/xml/2src.xml.gz

Can you quickly see if your FTP access is blocked on those clusters with something like

wget ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/xml/2src.xml.gz
jchodera commented 8 years ago

I was able to get past the SIFTS part on my desktop just now, though it oddly failed at a different point:

mski1776:ensembler-test choderaj$ ensembler quickmodel --target_uniprot_entry_name ABL1_HUMAN --uniprot_domain_regex '^Protein kinase' --template_pdbids 2SRC
WARNING: /opt/anaconda1anaconda2anaconda3 not found.
Ignoring mpi4py.
Using mpi4py on OS X with Anaconda currently requires that /opt/anaconda1anaconda2anaconda3 points to your Anaconda installation.
As a workaround, you can create a symlink, e.g. "sudo ln -s ~/anaconda /opt/anaconda1anaconda2anaconda3
Done.
Querying UniProt web server...
Number of entries returned from initial UniProt search: 2

Set of unique domain names returned from the initial UniProt search using the query string 'mnemonic:ABL1_HUMAN':
set(['SH2', 'SH3', 'Protein kinase'])

Unique domain names selected after searching with the case-sensitive regex string '^Protein kinase':
set(['Protein kinase'])

Done.
Downloading PDB file for: 2SRC
Downloading sifts file for: 2SRC
1 PDB chains selected.
Extracting residues from PDB chains...
1 templates selected.

Writing template structures...
Done.
Traceback (most recent call last):
  File "/Users/choderaj/miniconda/bin/ensembler", line 6, in <module>
    sys.exit(main())
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler-1.0.2-py2.7.egg/ensembler/cli.py", line 40, in main
    command.dispatch(args)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler-1.0.2-py2.7.egg/ensembler/cli_commands/quickmodel.py", line 106, in dispatch
    QuickModel(targetid=args['--targetid'], templateids=templateids, target_uniprot_entry_name=args['--target_uniprot_entry_name'], uniprot_domain_regex=args['--uniprot_domain_regex'], pdbids=pdbids, chainids=chainids_dict, template_uniprot_query=args['--template_uniprot_query'], template_seqid_cutoff=template_seqid_cutoff, loopmodel=not args['--no-loopmodel'], package_for_fah=args['--package_for_fah'], nfahclones=nfahclones, structure_dirs=structure_paths)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler-1.0.2-py2.7.egg/ensembler/tools/quick_model.py", line 72, in __init__
    templates_resolved_seq, templates_full_seq = ensembler.core.get_templates()
AttributeError: 'module' object has no attribute 'get_templates'
jchodera commented 8 years ago

My bad; I had two conflicting versions of ensembler installed.

Now I get something even more strange:

mski1776:ensembler-test choderaj$ ensembler quickmodel --target_uniprot_entry_name ABL1_HUMAN --uniprot_domain_regex '^Protein kinase' --template_pdbids 2SRC
WARNING: /opt/anaconda1anaconda2anaconda3 not found.
Ignoring mpi4py.
Using mpi4py on OS X with Anaconda currently requires that /opt/anaconda1anaconda2anaconda3 points to your Anaconda installation.
As a workaround, you can create a symlink, e.g. "sudo ln -s ~/anaconda /opt/anaconda1anaconda2anaconda3
Done.
Querying UniProt web server...
Number of entries returned from initial UniProt search: 2

Set of unique domain names returned from the initial UniProt search using the query string 'mnemonic:ABL1_HUMAN':
set(['SH2', 'SH3', 'Protein kinase'])

Unique domain names selected after searching with the case-sensitive regex string '^Protein kinase':
set(['Protein kinase'])

Done.
Downloading PDB file for: 2SRC
Downloading sifts file for: 2SRC
1 PDB chains selected.
Extracting residues from PDB chains...
1 templates selected.

Writing template structures...
Done.
MPI rank 0 pdbfixer error for template SRC_HUMAN_D0_2SRC_A - see logfile
Modeling missing loops for template SRC_HUMAN_D0_2SRC_A
Traceback (most recent call last):
  File "/Users/choderaj/miniconda/bin/ensembler", line 6, in <module>
    sys.exit(main())
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/cli.py", line 40, in main
    command.dispatch(args)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/cli_commands/quickmodel.py", line 106, in dispatch
    QuickModel(targetid=args['--targetid'], templateids=templateids, target_uniprot_entry_name=args['--target_uniprot_entry_name'], uniprot_domain_regex=args['--uniprot_domain_regex'], pdbids=pdbids, chainids=chainids_dict, template_uniprot_query=args['--template_uniprot_query'], template_seqid_cutoff=template_seqid_cutoff, loopmodel=not args['--no-loopmodel'], package_for_fah=args['--package_for_fah'], nfahclones=nfahclones, structure_dirs=structure_paths)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/tools/quick_model.py", line 95, in __init__
    self._model(self.targetid, self.templateids, loopmodel=self.loopmodel, package_for_fah=self.package_for_fah, nfahclones=self.nfahclones)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/tools/quick_model.py", line 138, in _model
    ensembler.modeling.model_template_loops(process_only_these_templates=templateids)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/utils.py", line 37, in print_done
    fn(*args, **kwargs)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/modeling.py", line 74, in model_template_loops
    loopmodel_templates(templates_resolved_seq, missing_residues_list, process_only_these_templates=process_only_these_templates, overwrite_structures=overwrite_structures)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/modeling.py", line 221, in loopmodel_templates
    loopmodel_template(template, missing_residues[template_index], overwrite_structures=overwrite_structures)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/modeling.py", line 234, in loopmodel_template
    write_loop_file(template, missing_residues)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/modeling.py", line 259, in write_loop_file
    loop_residues_data = [(key[1], len(residues)) for key, residues in missing_residues.iteritems()]
AttributeError: 'NoneType' object has no attribute 'iteritems'

I think we need to fix an issue with loop modeling to build in missing template residues when there are no missing residues, but if we don't need to do that, I supposedly can just add the --no-loopmodel flag:

ensembler quickmodel --target_uniprot_entry_name ABL1_HUMAN --uniprot_domain_regex '^Protein kinase' --template_pdbids 2SRC --no-loopmodel
jchodera commented 8 years ago

This successfully gets through the MODELLER stage, but then dies at the explicit solvent stage:

mski1776:ensembler-test choderaj$ ensembler quickmodel --target_uniprot_entry_name ABL1_HUMAN --uniprot_domain_regex '^Protein kinase' --template_pdbids 2SRC --no-loopmodel
WARNING: /opt/anaconda1anaconda2anaconda3 not found.
Ignoring mpi4py.
Using mpi4py on OS X with Anaconda currently requires that /opt/anaconda1anaconda2anaconda3 points to your Anaconda installation.
As a workaround, you can create a symlink, e.g. "sudo ln -s ~/anaconda /opt/anaconda1anaconda2anaconda3
Done.
Querying UniProt web server...
Number of entries returned from initial UniProt search: 2

Set of unique domain names returned from the initial UniProt search using the query string 'mnemonic:ABL1_HUMAN':
set(['SH2', 'SH3', 'Protein kinase'])

Unique domain names selected after searching with the case-sensitive regex string '^Protein kinase':
set(['Protein kinase'])

Done.
Downloading PDB file for: 2SRC
Downloading sifts file for: 2SRC
1 PDB chains selected.
Extracting residues from PDB chains...
1 templates selected.

Writing template structures...
Done.
Working on target ABL1_HUMAN_D0...
Done.
=========================================================================
Working on target "ABL1_HUMAN_D0"
=========================================================================
-------------------------------------------------------------------------
Modelling "ABL1_HUMAN_D0" => "SRC_HUMAN_D0_2SRC_A"
-------------------------------------------------------------------------
The following 1 residues contain 6-membered rings with poor geometries
after transfer from templates. Rebuilding rings from internal coordinates:
   <Residue 228 (type TYR)>
0 atoms in HETATM/BLK residues constrained
to protein atoms within 2.30 angstroms
and protein CA atoms within 10.00 angstroms
0 atoms in residues without defined topology
constrained to be rigid bodies

>> Summary of successfully produced models:
Filename                          molpdf
----------------------------------------
ABL1_HUMAN_D0.B99990001.pdb    50007.05859

Done.
Constructing a trajectory containing all valid models...
Conducting RMSD-based clustering...
1 unique models (from original set of 1) using cutoff of 0.060 nm
Done.
Auto-selected OpenMM platform: OpenCL
-------------------------------------------------------------------------
Simulating ABL1_HUMAN_D0 => SRC_HUMAN_D0_2SRC_A in implicit solvent for 100.0 ps (MPI rank: 0, GPU ID: 0)
-------------------------------------------------------------------------
/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/refinement.py:300: UserWarning: = ERROR start: MPI rank 0 hostname mski1776 gpuid 0 =
No compatible OpenCL platform is available
Traceback (most recent call last):
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 288, in refine_implicit_md
    simulate_implicit_md()
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 122, in simulate_implicit_md
    context = openmm.Context(system, integrator, platform, platform_properties)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 6469, in __init__
    this = _openmm.new_Context(*args)
Exception: No compatible OpenCL platform is available

= ERROR end: MPI rank 0 hostname mski1776 gpuid 0
  mpistate.rank, socket.gethostname(), gpuid, e, trbk
Done.
Done.
No nwaters information found.
Done.
Auto-selected OpenMM platform: OpenCL
Traceback (most recent call last):
  File "/Users/choderaj/miniconda/bin/ensembler", line 6, in <module>
    sys.exit(main())
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/cli.py", line 40, in main
    command.dispatch(args)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/cli_commands/quickmodel.py", line 106, in dispatch
    QuickModel(targetid=args['--targetid'], templateids=templateids, target_uniprot_entry_name=args['--target_uniprot_entry_name'], uniprot_domain_regex=args['--uniprot_domain_regex'], pdbids=pdbids, chainids=chainids_dict, template_uniprot_query=args['--template_uniprot_query'], template_seqid_cutoff=template_seqid_cutoff, loopmodel=not args['--no-loopmodel'], package_for_fah=args['--package_for_fah'], nfahclones=nfahclones, structure_dirs=structure_paths)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/tools/quick_model.py", line 95, in __init__
    self._model(self.targetid, self.templateids, loopmodel=self.loopmodel, package_for_fah=self.package_for_fah, nfahclones=self.nfahclones)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/tools/quick_model.py", line 145, in _model
    ensembler.refinement.refine_explicit_md(process_only_these_targets=targetid, process_only_these_templates=templateids, sim_length=self.sim_length)
  File "/Users/choderaj/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 1008, in refine_explicit_md
    with open(nwaters_filename, 'r') as infile:
IOError: [Errno 2] No such file or directory: '/Users/choderaj/Desktop/ensembler-test/models/ABL1_HUMAN_D0/nwaters-use.txt'

@sonyahanson: Any ideas here? Is this a real bug?

msultan commented 8 years ago

Alright, so it has now started working again. I am guessing it was the ftp issue. I was playing around with the commands a bit and I am guessing the sifts db blocked both the clusters due to large number of requests. Though, I only had 150 odd templates so it shouldn't really have caused an issue.

ensembler quickmodel --target_uniprot_entry_name ABL1_HUMAN --uniprot_domain_regex '^Protein kinase' --template_pdbids 2SRC
Querying UniProt web server...
Number of entries returned from initial UniProt search: 2

Set of unique domain names returned from the initial UniProt search using the query string 'mnemonic:ABL1_HUMAN':
set(['SH2', 'SH3', 'Protein kinase'])

Unique domain names selected after searching with the case-sensitive regex string '^Protein kinase':
set(['Protein kinase'])

Done.
1 PDB chains selected.
Extracting residues from PDB chains...
1 templates selected.

Writing template structures...
Done.
Modeling missing loops for template SRC_HUMAN_D0_2SRC_A
Done.
Working on target ABL1_HUMAN_D0...
Done.
=========================================================================
Working on target "ABL1_HUMAN_D0"
=========================================================================
-------------------------------------------------------------------------
Modelling "ABL1_HUMAN_D0" => "SRC_HUMAN_D0_2SRC_A"
-------------------------------------------------------------------------
...
Done.
Constructing a trajectory containing all valid models...
Conducting RMSD-based clustering...
1 unique models (from original set of 1) using cutoff of 0.060 nm
Done.
Auto-selected OpenMM platform: OpenCL

btw is there any way to limit templates to those within a certain resolution cutoff?

jchodera commented 8 years ago

By resolution, do you mean crystallographic resolution, allowing you to avoid low-resolution crystrallographic sturctures as templates?

I don't believe we have that capability yet, but it's a great idea that should be relatively easy to implement. Can you add a separate feature request, if this is what you intended?

msultan commented 8 years ago

Yea, thats what I meant. I will open the FR for it.

Out of curiosity, what information does the SIFTS file add that is not available in the RCSB databank? I have done a similar pipeline on a smaller scale with set of scripts and I thought the PDB+alignment+modeller was all that was needed

jchodera commented 8 years ago

Good question. I think SIFTS provides a nice machine-readable annotation that contains pointers to numerous other databases. I dimly recall @danielparton noting that it has useful cross-references to canonical residue numbering schemes in UniProt, which we use for referencing which domains or sequence subsets we are modeling. This may not be useful for individual quick-model one template-one target modeling, but is essential when working on the superfamily scale.

jchodera commented 8 years ago

Go ahead and close this issue if your problems are solved?

msultan commented 8 years ago

Thanks!

danielparton commented 8 years ago

Yep, SIFTS has residue-level mappings between PDB, UniProt and other databases.

The SIFTS server is often a bit flaky... I sometimes have to repeat a command to get a SIFTS file to download. On Dec 1, 2015 8:47 PM, "John Chodera" notifications@github.com wrote:

Good question. I think SIFTS provides a nice machine-readable annotation that contains pointers to numerous other databases. I dimly recall @danielparton https://github.com/danielparton noting that it has useful cross-references to canonical residue numbering schemes in UniProt, which we use for referencing which domains or sequence subsets we are modeling. This may not be useful for individual quick-model one template-one target modeling, but is essential when working on the superfamily scale.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler/issues/62#issuecomment-161153251 .