Closed sillitoe closed 4 years ago
Just to check are you using the flag : --min-dc-hmm-coverage=80 as suggested in ftp://orengoftp.biochem.ucl.ac.uk/gene3d/CURRENT_RELEASE/gene3d_hmmsearch/README_scan.txt
On Mon, Dec 9, 2019 at 4:38 PM Ian Sillitoe notifications@github.com wrote:
Relaying a question from an email - Harry wants to encourage longer domains in the final resolved domain boundaries (reason: the superfamily in question has discontinuous domains and we want to avoid the possibility of getting individual "segments", rather than the full domain).
I tried running cath-resolve-hits with --long-domains-preference set to 2 and 3.5 and it does seem to improve things. But I don't really know what sort of value it should be set to
I couldn't find any documentation on recommended ranges to use with this param - any ideas?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UCLOrengoGroup/cath-tools/issues/71?email_source=notifications&email_token=ABWRCWCBKANUY5AI3DUN5L3QXZYCFA5CNFSM4JYMJEA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H7EGIWQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWRCWG72TTKXLWPLO67VJDQXZYCFANCNFSM4JYMJEAQ .
Thanks Jon - not sure whether the models in this library are explicitly named as dc_*
.
@harryscholes?
Hi Jon, I do use --min-dc-hmm-coverage=80
for resolving hmmsearch
outputs of Gene3D S95 models using cath-resolve-hits
. Should I also use this option for resolving FunFam hits? I am finding a possible problem, where I am getting a lot of very short FunFam matches that are much shorter than the FunFam HMM lengths.
Results with:
cath-resolve-hits --input-format hmmsearch_out --min-dc-hmm-coverage=80 --worst-permissible-bitscore 25
I would say that this has worked. Thanks @jonglees.
@sillitoe my code was inspired by https://github.com/UCLOrengoGroup/cath-tools-genomescan. Do you think the following lines should be updated to include these cath-resolve-hits
filters?
Sorry for the slow response.
Are y'all happy with where this has ended up? Please shout if not.
From what I can see in the code, --long-domains-preference x
causes the score of a hit to be multiplied by the length of the hit to the power of x
. So: the default value of 0 means all hits are multiplied by 1 (ie remain equally preferred relative to each other); 1 would mean multiplying every hit's score by its length; -1 would mean dividing every hit's score by its length.
(Detail: actually each length is divided by 400 before raising it to the power so as to prevent the numbers from getting very silly very quickly but that doesn't affect how much one hit is preferred relative to another.)
But this is just knob that's available for you to twiddle to give a bit of control; if the details of the numbers are important to you, you should probably be generating your own scores.
Thanks for the clear explanation Tony! I tried running crh with the following options:
cath-resolve-hits \
--input-format hmmsearch_out \
--min-dc-hmm-coverage 80 \
--worst-permissible-bitscore 25 \
--long-domains-preference {2, 3.5}
Resulting in the following distribution of match lengths, whether I used 2
or 3.5
:
I'm happy with this, so I think the issue can be closed. Thanks all!
Really clear (and useful) results - thanks all
Relaying a question from an email - Harry wants to encourage longer domains in the final resolved domain boundaries (reason: the superfamily in question has discontinuous domains and we want to avoid the possibility of getting individual "segments", rather than the full domain).
I couldn't find any documentation on recommended ranges to use with this param - any ideas?