UCLOrengoGroup / cath-tools

Protein structure comparison tools such as SSAP and SNAP
http://cath-tools.readthedocs.io
GNU General Public License v3.0
57 stars 14 forks source link

CRH: suggested range for '--long-domains-preference'? #71

Closed sillitoe closed 4 years ago

sillitoe commented 4 years ago

Relaying a question from an email - Harry wants to encourage longer domains in the final resolved domain boundaries (reason: the superfamily in question has discontinuous domains and we want to avoid the possibility of getting individual "segments", rather than the full domain).

I tried running cath-resolve-hits with --long-domains-preference set to 2 and 3.5 and it does seem to improve things. But I don't really know what sort of value it should be set to

I couldn't find any documentation on recommended ranges to use with this param - any ideas?

jonglees commented 4 years ago

Just to check are you using the flag : --min-dc-hmm-coverage=80 as suggested in ftp://orengoftp.biochem.ucl.ac.uk/gene3d/CURRENT_RELEASE/gene3d_hmmsearch/README_scan.txt

On Mon, Dec 9, 2019 at 4:38 PM Ian Sillitoe notifications@github.com wrote:

Relaying a question from an email - Harry wants to encourage longer domains in the final resolved domain boundaries (reason: the superfamily in question has discontinuous domains and we want to avoid the possibility of getting individual "segments", rather than the full domain).

I tried running cath-resolve-hits with --long-domains-preference set to 2 and 3.5 and it does seem to improve things. But I don't really know what sort of value it should be set to

I couldn't find any documentation on recommended ranges to use with this param - any ideas?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UCLOrengoGroup/cath-tools/issues/71?email_source=notifications&email_token=ABWRCWCBKANUY5AI3DUN5L3QXZYCFA5CNFSM4JYMJEA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H7EGIWQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWRCWG72TTKXLWPLO67VJDQXZYCFANCNFSM4JYMJEAQ .

sillitoe commented 4 years ago

Thanks Jon - not sure whether the models in this library are explicitly named as dc_*.

@harryscholes?

harryscholes commented 4 years ago

Hi Jon, I do use --min-dc-hmm-coverage=80 for resolving hmmsearch outputs of Gene3D S95 models using cath-resolve-hits. Should I also use this option for resolving FunFam hits? I am finding a possible problem, where I am getting a lot of very short FunFam matches that are much shorter than the FunFam HMM lengths.

harryscholes commented 4 years ago

image

harryscholes commented 4 years ago

Results with:

cath-resolve-hits --input-format hmmsearch_out --min-dc-hmm-coverage=80 --worst-permissible-bitscore 25

image

I would say that this has worked. Thanks @jonglees.

@sillitoe my code was inspired by https://github.com/UCLOrengoGroup/cath-tools-genomescan. Do you think the following lines should be updated to include these cath-resolve-hits filters?

https://github.com/UCLOrengoGroup/cath-tools-genomescan/blob/bd20da37e240f794397c087841484b73cf16fb0a/apps/cath-genomescan.pl#L129-L139

tonyelewis commented 4 years ago

Sorry for the slow response.

Are y'all happy with where this has ended up? Please shout if not.

From what I can see in the code, --long-domains-preference x causes the score of a hit to be multiplied by the length of the hit to the power of x. So: the default value of 0 means all hits are multiplied by 1 (ie remain equally preferred relative to each other); 1 would mean multiplying every hit's score by its length; -1 would mean dividing every hit's score by its length.

(Detail: actually each length is divided by 400 before raising it to the power so as to prevent the numbers from getting very silly very quickly but that doesn't affect how much one hit is preferred relative to another.)

But this is just knob that's available for you to twiddle to give a bit of control; if the details of the numbers are important to you, you should probably be generating your own scores.

harryscholes commented 4 years ago

Thanks for the clear explanation Tony! I tried running crh with the following options:

cath-resolve-hits \
    --input-format hmmsearch_out \
    --min-dc-hmm-coverage 80 \
    --worst-permissible-bitscore 25 \
    --long-domains-preference {2, 3.5}

Resulting in the following distribution of match lengths, whether I used 2 or 3.5:

image

I'm happy with this, so I think the issue can be closed. Thanks all!

sillitoe commented 4 years ago

Really clear (and useful) results - thanks all