choderalab / ensembler

Automated omics-scale protein modeling and simulation setup.
http://ensembler.readthedocs.io/
GNU General Public License v2.0
52 stars 21 forks source link

Proposing improvements to quickmodel command line arguments #70

Closed rafwiewiora closed 8 years ago

rafwiewiora commented 8 years ago

@jchodera

Pasting from Slack:

About the code question - what I meant is, in words what I want Ensembler to do here is - look at Uniprot protein code Q9NQR1 - take the sequence of that protein and use that as target & take all PDB structures listed for that protein and use them as templates; align and model and much sequence as is possible with these inputs

[12:43] so the only user input I made here is really the Uniprot code

[12:43] which is why I’m a bit surprised I have to write it out as this ensembler quickmodel --target_uniprot_entry_name "SETD8_HUMAN" --template_uniprot_query "accession:q9nqr1" --no-loopmodel --uniprot_domain_regex "SET"

john.chodera [12:44 PM] If you try to model a multidomain protein with ensembler it will likely give you garbage

[12:44] Are you asking if we could just make --uniprot_domain_regex <regex> optional if you want the whole thing?

[12:44] And maybe --template_uniprot_query should also default to the target_uniprot_entry_name first variant?

[12:45] Those sound like reasonable behaviors.

[12:45] Maybe open an issue to suggest these, and we can see how feasible that is?

rafwiewiora [12:49 PM] The domain regex is fine - yeah, I get the potential problems - no problems here even though the Uniprot sequence is not limited to SET domain, but all but 1 templates are only the SET domain - so I think within the full pipeline i managed to not specify the domain and Modeller got it right - but I guess the problems would arise if the templates have different sequence coverages. So that’s fine!

But yeah the —template_uniprot_query' and—target_uniprot_entry_nameare specifying the same thing, but in 2 different ways - that is very confusing I think, so I would suggest: a) Allowing using the same thing for both + default same would be nice b) I was also confused about the other arguments—targetidsandtemplateids, they sounds like the Uniprot code should be passed, but in fact they’re names of folders containing targets and templates in the project folder if they exist already right? So that had me trapped for a moment - first thing I’m going to write more docs for. Should we just have a ‘target_and_template_uniprot_code or something like that?

jchodera commented 8 years ago

Can you reformat this as a short, succinct statement of the suggested improvement(s)?

rafwiewiora commented 8 years ago

Of course, sorry!

Action I wanted: Look at Uniprot code: Q9NQR1 and model the SET domain using all PDB structures available under that code.

Required input:

ensembler quickmodel --target_uniprot_entry_name "SETD8_HUMAN" --template_uniprot_query "accession:q9nqr1" --no-loopmodel --uniprot_domain_regex "SET"

Redundancies: --target_uniprot_entry_name "SETD8_HUMAN" and --template_uniprot_query "accession:q9nqr1" - same thing specified twice, in two different forms - also no availability to just pass the Q9NQR1 code - why have to generate the accession:q9nqr1 string (I never remember the word is accession...)

Proposal: replace the redundancy with a single, just Uniprot code containing, argument:

--target_template_uniprot_code Q9NQR1

so the full call would be for me:

ensembler quickmodel --target_template_uniprot_code Q9NQR1--no-loopmodel --uniprot_domain_regex "SET"

Further proposal would be to get rid of having to mention the domain, and just do 'as much modeling on the given sequence from Uniprot code as possible' - i.e. just cut the sequence to the biggest coverage from PDB available. This must be done somewhere internally already or Modeller does it - but perhaps this might lead to problems because I've only tried one use case here.

danielparton commented 8 years ago

Quickmodel is designed to model a single target protein with a small number of template structures. But the template protein doesn't have to be the same as the target protein. Hence there are separate flags to define target and template. If you want to use the same protein as target and template, then there will be redundancy in the command. To avoid this, you could implement the following in the CLI (< https://github.com/choderalab/ensembler/blob/master/ensembler/cli_commands/quickmodel.py

):

  • If no template-defining flags are entered (e.g. --template_uniprot_query), have quickmodel default to using the protein defined with --target_uniprot_entry_name

Regarding accession codes, I was trying to allow the user to use UniProt mnemonics (e.g. "SETD8_HUMAN") where possible, rather than accession code (e.g. Q9NQR1), since it is more human-readable. Both are unique identifiers in UniProt. If you definitely want to be able to use accession code, you could perhaps add a --target_uniprot_ac flag (where ac means accession code).

The docstring for this command (ensembler quickmodel -h) does contain examples for each of the flags. But there are a lot of available flags, so could be good to add further documentation, e.g. commented examples for each use case.

On Mon, Apr 4, 2016 at 2:19 PM, Rafal Wiewiora notifications@github.com wrote:

Of course, sorry!

Action I wanted: Look at Uniprot code: Q9NQR1 and model the SET domain using all PDB structures available under that code.

Required input:

ensembler quickmodel --target_uniprot_entry_name "SETD8_HUMAN" --template_uniprot_query "accession:q9nqr1" --no-loopmodel --uniprot_domain_regex "SET"

Redundancies: --target_uniprot_entry_name "SETD8_HUMAN" and --template_uniprot_query "accession:q9nqr1" - same thing specified twice, in two different forms - also no availability to just pass the Q9NQR1 code - why have to generate the accession:q9nqr1 string (I never remember the word is accession...)

Proposal: replace the redundancy with a single, just Uniprot code containing, argument:

--target_template_uniprot_code Q9NQR1

so the full call would be for me:

ensembler quickmodel --target_template_uniprot_code Q9NQR1--no-loopmodel --uniprot_domain_regex "SET"

Further proposal would be to get rid of having to mention the domain, and just do 'as much modeling on the given sequence from Uniprot code as possible' - i.e. just cut the sequence to the biggest coverage from PDB available. This must be done somewhere internally already or Modeller does it - but perhaps this might lead to problems because I've only tried one use case here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler/issues/70#issuecomment-205430855

rafwiewiora commented 8 years ago

Thanks @danielparton !

I think what actually bothered me more was that the target specification and template specifications are done in different ways (name vs search query) - so I think a great solution might be to just add a --template_uniprot_entry_name' argument so you can call: ensembler quickmodel --target_uniprot_entry_name SETD8_HUMAN --template_uniprot_entry_name SETD8_HUMAN`

But aha! I can just pass SETD8_HUMAN as --template_uniprot_query.

Same with the code - I can see the reasoning now!

I think the only thing missing is just better docs - because you only wrote one comment as it's much clearer to me now! So I'll write more docstrings as I explore the code more!

danielparton commented 8 years ago

Sure thing. Yeah, documentation could definitely be improved. Probably good for someone else to give that try - I probably know the code too well... But feel free to ask if you have questions

On Mon, Apr 4, 2016 at 4:23 PM, Rafal Wiewiora notifications@github.com wrote:

Thanks @danielparton https://github.com/danielparton !

I think what actually bothered me more was that the target specification and template specifications are done in different ways (name vs search query) - so I think a great solution might be to just add a --template_uniprot_entry_name' argument so you can call: ensembler quickmodel --target_uniprot_entry_name SETD8_HUMAN --template_uniprot_entry_name SETD8_HUMAN`

But aha! I can just pass SETD8_HUMAN as --template_uniprot_query.

Same with the code - I can see the reasoning now!

I think the only thing missing is just better docs - because you only wrote one comment as it's much clearer to me now! So I'll write more docstrings as I explore the code more!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler/issues/70#issuecomment-205478726

rafwiewiora commented 8 years ago

Ok cool, will do! Thanks Danny!