Arcadia-Science / ProteinCartography

a pipeline to build similarity maps of protein space
MIT License
30 stars 10 forks source link

Add parameter for user to define hits must be certain % of the length of the reference protein to include in results #34

Closed elizabethmcd closed 1 year ago

elizabethmcd commented 1 year ago

Some hits come back with a very high Tm score but are often times less than 50% of the length of the reference protein. An example of this is a query that is a full-length reference protein having good Tm scores to proteins that are maybe 20% of the length of the reference protein because they are partial clone sequences and not full-length proteins. A good option to circumvent this is to add an option where the user defines what percent length of the reference query the hits must be (like 70%), add that filter prior to clustering and only plot the results for proteins that pass that filter. This will help with sifting out things that sometimes might technically be good hits but are essentially trash

mezarque commented 1 year ago

This sounds great!

I think there should probably be other filtering in addition to the things you mentioned. Uniprot theoretically has a fragment field that can be returned, although it doesn't seem the pipeline retrieves that information yet.

I will look into this and add some filtering options based on minimum protein length (or fraction of input protein), protein fragment information, and also other things like sequence version (the pipeline currently also retrieves out-of-date or not-maintained proteins). By default the pipeline will probably filter out out-of-date and fragmentary proteins, and the user will be able to decide whether to be more stringent or to relax from there.

mezarque commented 1 year ago

A version of this is included as part of #39 - you can specify in the config YAML a min_length and max_length parameter, which is an integer number of amino acids.