Rostlab / MetaDisorder

Protein sequenced-based Disorder Predictor
1 stars 1 forks source link

Research dataset sizes #2

Closed juanmirocks closed 8 years ago

juanmirocks commented 8 years ago
juanmirocks commented 8 years ago

Any news on this?

fsblu commented 8 years ago

Hello,

Yes. Our method used Disprot 3.4 with 460 proteins. Out of 77 proteins were eliminated from the beginning. There are alots of newer versions of disprot: The newest one is:

DisProt Release 6.02, 2013-05-24 Number of proteins: 694 Number of disordered regions: 1539

I am having tough time understanding this paragraph.

"The entire data set included 298 sequence-unique proteins with 27,117 disordered (positives) and 61,118 well-structured (negatives) residues. Our results were qualitatively similar for sequence-unique filtering at HSSP-values<0 (i.e., 21% pairwise sequence identity for >250 aligned residues); however, for that number only 135 proteins remained in the DisProt data set."

Could you please explain me what does that 135 proteins correspond to? Is it the result of alignment?

2016-01-20 13:17 GMT+01:00 Juan Miguel Cejuela notifications@github.com:

Any news on this?

— Reply to this email directly or view it on GitHub https://github.com/Rostlab/MetaDisorder/issues/2#issuecomment-173186631.


İrem Uygur

GSM: (+49) 01575 684 25 74 (+90) 537484 04 71

Skype:irem.uygur G-mail: iremuygur1@gmail.com

juanmirocks commented 8 years ago

They way I understand it, when:

Then you can read in Figure S2f

Per-protein performance on long disordered regions. Data set: 86 DisProt proteins with at least one long (>30 residues) disordered region. This set was compiled using more stringent cutoff for homology (HSSP-values<0). Our final method MD identified more true positives than the other methods at most of the false positive rates. Note that this set is much smaller than the one compiled using HSSP-values<10 that the error margins are significantly higher.

That to me means that in the end, they really used only the subset of 135 proteins for training. The initial set of 460 was not fully used in the method because it contained high-similarity sequences, according to their HSSP-based redundancy filters.

juanmirocks commented 8 years ago

Therefore, the only remaining question is the 2 checkbox, namely, whether there are other newer/bigger datasets.

fsblu commented 8 years ago

I have included inside the answer to the presentation, but I have made a search.

The other data set options:

D2P2 http://d2p2.pro/about/database (This is larger than disprot) MOBIDB http://mobidb.bio.unipd.it/ (This one takes information from disprot, I am a bit confused about this one since it looks like a prediction method, more than a dataset. I did not get why it is refered as a dataset) PRODDO http://bioinformatics.oxfordjournals.org/content/17/4/379.full.pdf+html (and then there is this one)

Those are larger than disprot, but the same protein might have been refered several times, I do not know whether this is the case in disprot. It seems like disprot is the used more than any of those datasets.

2016-01-20 18:33 GMT+01:00 Juan Miguel Cejuela notifications@github.com:

Therefore, the only remaining question is the 2 checkbox, namely, whether there are other newer/bigger datasets.

— Reply to this email directly or view it on GitHub https://github.com/Rostlab/MetaDisorder/issues/2#issuecomment-173295499.


İrem Uygur

GSM: (+49) 01575 684 25 74 (+90) 537484 04 71

Skype:irem.uygur G-mail: iremuygur1@gmail.com

fsblu commented 8 years ago

And thank you for the explanation :)

2016-01-20 18:48 GMT+01:00 İrem Uygur iremuygur1@gmail.com:

I have included inside the answer to the presentation, but I have made a search.

The other data set options:

D2P2 http://d2p2.pro/about/database (This is larger than disprot) MOBIDB http://mobidb.bio.unipd.it/ (This one takes information from disprot, I am a bit confused about this one since it looks like a prediction method, more than a dataset. I did not get why it is refered as a dataset) PRODDO http://bioinformatics.oxfordjournals.org/content/17/4/379.full.pdf+html (and then there is this one)

Those are larger than disprot, but the same protein might have been refered several times, I do not know whether this is the case in disprot. It seems like disprot is the used more than any of those datasets.

2016-01-20 18:33 GMT+01:00 Juan Miguel Cejuela notifications@github.com:

Therefore, the only remaining question is the 2 checkbox, namely, whether there are other newer/bigger datasets.

— Reply to this email directly or view it on GitHub https://github.com/Rostlab/MetaDisorder/issues/2#issuecomment-173295499 .


İrem Uygur

GSM: (+49) 01575 684 25 74 (+90) 537484 04 71

Skype:irem.uygur G-mail: iremuygur1@gmail.com


İrem Uygur

GSM: (+49) 01575 684 25 74 (+90) 537484 04 71

Skype:irem.uygur G-mail: iremuygur1@gmail.com

juanmirocks commented 8 years ago

Good. Just get the sizes of those DBs and you are good to go.

Enough reading the numbers they claim, even if they may contain repetitions.

sirmaG commented 8 years ago

I found the following information about the sizes of the above mentioned databases:

juanmirocks commented 8 years ago

Thanks :+1: