UCLOrengoGroup / cath-tools-seqscan

CATH: scan/align protein sequences against functional families
3 stars 0 forks source link

Control number of FF #3

Closed dudimarcus closed 8 years ago

dudimarcus commented 8 years ago

Currently the top 50 funfams are retrieved, any chance to control this or increase this number?

sillitoe commented 8 years ago

456ad68 should address this

dudimarcus commented 8 years ago

Great! Thanks Ian.

It seems there is still some limiting threshold that controls the number of FFs being retrieved since even with high number of hits there are low number of FF and sometimes only one or two domains with many FFs, could it be the significance score?

sillitoe commented 8 years ago

Could you add a test case?

e.g file containing query sequence, command line usage, what you expected, what you get

On 6 Sep 2016 5:28 p.m., "David Marcus" notifications@github.com wrote:

Great! Thanks Ian.

It seems there is still some limiting threshold that still controls the number of FFs retrieved since even with high numbers of hits there are low number of FF and sometimes only one or two domains with many FFs, could it be the significance score?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sillitoe/cath-tools-seqscan/issues/3#issuecomment-244883775, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJVerezWIFMW0ip7leZZXCQG7kPwUccks5qnSQTgaJpZM4J1H7B .

dudimarcus commented 8 years ago

I guess the question if there is something limiting the number of FFs other than hit limit like e-value? for example for uniprot id Q8E9X9 finds 10 FF with 1 domain:

HIT 3.40.50.1580/FF/5979 4.6e-143 Uridine phosphorylase HIT 3.40.50.1580/FF/2436 1.4e-44 Uridine phosphorylase HIT 3.40.50.1580/FF/4594 1.9e-34 Phosphorylase superfamily protein HIT 3.40.50.1580/FF/6092 2.1e-30 Purine nucleoside phosphorylase DeoD-type HIT 3.40.50.1580/FF/4309 4.3e-30 Purine or other phosphorylase family 1 HIT 3.40.50.1580/FF/3469 1.1e-23 Uridine phosphorylase HIT 3.40.50.1580/FF/6091 1.3e-15 Uridine phosphorylase 1, isoform CRA_a HIT 3.40.50.1580/FF/2937 4.7e-10 Uridine phosphorylase HIT 3.40.50.1580/FF/6120 3.1e-07 MTA/SAH nucleosidase HIT 3.40.50.1580/FF/5988 1.9e-04 Putative AMP nucleosidase

sillitoe commented 8 years ago

No, not aware of any other limit - that's just how many FunFams match. You wouldn't really want to go with much higher e-values than that anyway.

Bear in mind the FunFam HMMs are deliberately designed to be specific. When we are interested in increasing coverage, then we use HMMs built from jackhmmer to catch more general matches. The inferences between the two types of matches are different though.

sillitoe commented 8 years ago

I might not fully understand your question.

What makes you think there might be a limit? Were you expecting more than those 10 FunFam hits?

dudimarcus commented 8 years ago

No, so far all work as expected. I just wanted to make sure there is no other limitation for the number of hits.

sillitoe commented 8 years ago

Okay, cool. I'll close the ticket.

By the way, did Roman manage to sort out the Perl issues he was having? If not, it would be great if you could help him to get stuff working on whatever setup he was using (or let me know more details).