leuaqut commented 2 years ago

Hello FAS creators

I am interested in running the FAS program using interproscan results. However i am abit stump on making the featuretype file. Currently interproscan output these annotations: CDD Coils Gene3D MobiDBLite PANTHER Pfam PRINTS ProSitePatterns ProSiteProfiles SMART SUPERFAMILY TIGRFAM

Do you guys have any idea how to categorise these? Also do you have any preference on the weighting parameters? If possible, would you be able to clarifying what the --bidirectional setting is doing?

Thanks!

JuRuDo commented 2 years ago

Hi, Choosing which databases should be used for the linearization is a bit hard. I would recommend not to linearize PANTHER and TIGRFAM as these only predict the family of the whole protein. For the rest it is a bit harder to decide. Generally, I believe it is better not to have too many databases in the linearized section as this raises the complexity of the calculations. As Pfam, SMART, CDD and SUPERFAMILY have some overlaps in their annotations, a possible featuretype file could look like this:

linearized

Pfam SMART SUPERFAMILY CDD

normal

PANTHER TIGRFAM ProSiteProfiles ProSitePatterns PRINTS MobiDBLite Gene3D Coils

checked

Unless you are interested in specific domains or databases I recommend going with the default statistical weighting where you give a reference proteome (usually that of the query protein). The weighting will then be based on its occurrences in the reference so that abundant domains have lower weighting and rare domains have higher weighting.

The FAS score is a directional score. This means that in the so called default 'forward' score it punishes domains that are in the seed protein but not in the query. It does not punish domains that are in the query but not in the seed. With the --bidirectional option active it also calculates what we call the 'backwards' score where the this schema is flipped (domains that are in the query but not in the seed are punished). So if the forward score is low but the backward is high this means that the seed protein contains domains that are not in the query. If the forward score is high and the backward score is low than the query protein has domains that the seed lacks. If both are low than both proteins have domains that the other does not.

I hope this was helpful

leuaqut commented 2 years ago

Thanks a lot Julian! That was very helpful.

JuRuDo commented 1 year ago

closed

BIONF / FAS

Creating a featuretype file #25

linearized

normal

checked