fhcrc / seqmagick

An imagemagick-like frontend to Biopython SeqIO
http://seqmagick.readthedocs.org
GNU General Public License v3.0
113 stars 22 forks source link

Protparam #41

Closed omgwtfgames closed 4 years ago

omgwtfgames commented 10 years ago

This patch adds a "protparam" command for calculating and sorting a set protein sequences on molecular weight and predicted isoelectric point.

I actually wrote most of this code ~2 years ago for my own use, but forgot to clean it up and send a pull request :P. If this is a feature you'd like to integrate, I can also add some documentation.

(err .. bugger .. sent this pull request while logged into the wrong Github account too ...)

matsen commented 10 years ago

Totally cool, and thanks!

You say

Calculate molecular weight, theoretical isoelectric point and other physiochemical properties of protein sequences.

Am I mistaken, or does it only calculate the first two?

Also, does it make sense to continue adding to weight, etc, after a stop codon?

omgwtfgames commented 10 years ago

You are not mistaken - the docstring was out of sync with where the code ended up. The Bio.SeqUtils.Protparam that it's based on will calculate other properties and scores (similar to the ExPASy tool of the same name). Initially I intended to add options to output these other scores but ultimately decided to only implement MW and pI. I've fixed the docstring in the branch now.

Also, does it make sense to continue adding to weight, etc, after a stop codon?

Ah, the age old question of what to do in the face of any non-standard variation in a protein sequence file :) I'm not sure I've ever seen a stop "" in the middle of protein sequence, but anything is possible. I think the best solution is to run in strict validation mode by default, where execution stops and an error is reported if any characters outside the standard 20 uppercase amino acids are present (but allowing a \ to be trimmed from the end). If gaps or lowercase characters are present, "convert --ungap --upper" can be run first.

Upon reviewing this further, I also found that my own molecular weight function that explicitly avoids accounting for waters (while useful for my purpose at the time) is probably the wrong one to include when actually calculating masses under the banner of "protparam". Bio.SeqUtils.ProtParam.molecular_weight() gives masses that are (almost) the same as ExPASy ProtParam, so I'm now deferring to Biopython for this.

I've added some docs. I should also add tests - I might have to restructure things into a class to do this.

matsen commented 10 years ago

I agree. Thanks for your work on this!

On Tue, May 20, 2014 at 6:35 AM, Andrew Perry notifications@github.comwrote:

You are not mistaken - the docstring was out of sync with where the code ended up. The Bio.SeqUtils.Protparam that it's based on will calculate other properties and scores (similar to the ExPASy tool of the same name). Initially I intended to add options to output these other scores but ultimately decided to only implement MW and pI. I've fixed the docstring in the branch now.

Also, does it make sense to continue adding to weight, etc, after a stop codon?

Ah, the age old question of what to do in the face of any non-standard variation in a protein sequence file :) I'm not sure I've ever seen a stop "" in the middle of protein sequence, but anything is possible. I think the best solution is to run in strict validation mode by default, where execution stops and an error is reported if any characters outside the standard 20 uppercase amino acids are present (but allowing a \ to be trimmed from the end). If gaps or lowercase characters are present, "convert --ungap --upper" can be run first.

Upon reviewing this further, I also found that my own molecular weight function that explicitly avoids accounting for waters (while useful for my purpose at the time) is probably the wrong one to include when actually calculating masses under the banner of "protparam". Bio.SeqUtils.ProtParam.molecular_weight() gives masses that are (almost) the same as ExPASy ProtParam, so I'm now deferring to Biopython for this.

I've added some docs. I should also add tests - I might have to restructure things into a class to do this.

Reply to this email directly or view it on GitHubhttps://github.com/fhcrc/seqmagick/pull/41#issuecomment-43625867 .

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/

metasoarous commented 6 years ago

@omgwtfgames @matsen Is this work ready to merge as far as you're concerned?

@matsen If you like, I'm happy to do review and/or merge on this while my attention is on seqmagick.

omgwtfgames commented 6 years ago

Yep, good to merge from my end.

tillea commented 4 years ago

Any progress on accepting this pull request?

matsen commented 4 years ago

Boy, we did drop the ball here.

You really do want this? It seems a bit at odds with the rest of the utilities in the package.

tillea commented 4 years ago

Boy, we did drop the ball here. You really do want this? It seems a bit at odds with the rest of the utilities in the package.

I just maintain the Debian package of seqmagick and realised that there is a pending user request that might or might not be interesting for Debian users. I personally do not want this and do not have any opinion on it.

matsen commented 4 years ago

There's a debian package???

OK, I'll close this. I don't think it's of general interest.

tillea commented 4 years ago

On Wed, Jul 08, 2020 at 02:30:09AM -0700, Erick Matsen wrote:

There's a debian package??? The Debian Med team is packaging lots of bioinformatics stuff:

 https://blends.debian.org/med/tasks/bio#seqmagick

You might like to check out more. Feel free to mention it in your install instructions. May be we need to advertise our work a bit better ...