ccdmb / predector

Effector prediction pipeline based on protein properties.
Apache License 2.0
11 stars 7 forks source link

about the TM domain prediction #46

Open xizhesun opened 3 years ago

xizhesun commented 3 years ago

Dear Darcy,

When I run the pipeline, I found that the complete protein sequences were set as the input of the TMHMM. But I thought the mature protein sequences predicted by signalP would be better than the complete protein sequences. Because the TM domain on signal peptide would have no function. What do you think about this?

Thanks, Xizhe

xizhesun commented 3 years ago

For example, a paper using mature protein sequences as the input file of TMHMM.

https://www.frontiersin.org/articles/10.3389/fpls.2014.00098/full

(3) no transmembrane domain was predicted to occur after the cleavage site using Tmhmm v2.0c;

Cheers, Xizhe

darcyabjones commented 3 years ago

Hi Xizhe,

Usually the approach we take is to just ignore any TM domains predicted within the SP region. A lot of the point of the ranking part of the pipeline was to discourage the use of a series of hard filters. Because the error cumulatively increases as you add more prediction methods. If we only input mature sequences to TMHMM we can't look for effector-like proteins that lack signal peptides or which are incorrectly not predicted to have signal peptides.

The outputs includes the positions of the predicted TM domains, and also the estimated number of TM bases within the first 60 AAs (which the LTR model uses to decide if it should worry about any TM domains).

I'm not personally in favour of restricting it.

xizhesun commented 3 years ago

I know what you mean and I agree with you point. There's a better choice, we could combine the mature protein sequences (proteins with SP) and other complete protein sequences (proteins without SP) together as the input file of TMHMM. It will be accurate and not lost any candidates!

darcyabjones commented 3 years ago

We've had a bit of an internal discussion about this one. The consensus was that people tend to look at where the TM domain prediction is, and if a predicted SP overlaps it they discount that TM domain.

I can imagine some edge cases where your suggestion might provide some benefit. But there are a couple of technical issues that it introduces as well (e,g. how should we find a consensus SP cut-site from multiple programs?, when should you take the mature sequence instead of immature? etc).

I think the best way to settle this is to benchmark it and see what happens. Part of the point of this project was to find out what the best way to combine these tasks was, so i'll be interested to see how it goes. It'll probably have to wait until we get around to updating the ranking function.

I'll leave this open as a reminder until then and hopefully we'll know in the next major release.

Thanks for the suggestion!