Results on CASP14 targets

RosettaCommons / RoseTTAFold

This package contains deep learning models and related scripts for RoseTTAFold

MIT License

1.98k stars 435 forks source link

Results on CASP14 targets #18

Open jiaxiang-wu opened 3 years ago

jiaxiang-wu commented 3 years ago

I am running some experiment with RoseTTAFold on CASP14 targets, but are uncertain about a few detailed settings:

Input FASTA sequences. Do you use per-chain FASTA sequences as provided on the CASP14 website (w/o official domain definitions), or do you use per-domain FASTA sequences (cropped from per-chain sequences using official domain definitions)? If the latter one is the case, how do you deal with discontinuous domains, e.g. T1027-D1?
Sequence/template databases. For experiments on CASP14 targets, is it correct to use following databases (to prevent data leakage)?
- UniRef30 @ 2020-03: http://wwwuser.gwdg.de/~compbiol/uniclust/2020_03/
- BFD: https://bfd.mmseqs.com/
- PDB100 @ 2020-03: https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2020Mar11.tar.gz
Do we need to modify run_e2e_ver.sh and/or run_pyrosetta_ver.sh (or any other scripts), so as to reproduce results as reported in the paper?

Could you clarify above questions? Many thanks in advance.

minkbaek commented 3 years ago

I just added a link for CASP14 results to the README.md file (link -- https://files.ipd.uw.edu/pub/RoseTTAFold/casp14_models.tar.gz) It includes input MSA and template (hhr and atab) files as well as five RoseTTAFold models for each target.

jiaxiang-wu commented 3 years ago

Thanks for your feedback. So you are using per-chain FASTA sequences as inputs (since provided results are for per-chain, not per-domain inputs), which resolves my first question.

Are sequence/template databases (UniRef30 / BFD / PDB100) mentioned above correct?
Is it expected that if I run run_pyrosetta_ver.sh as it is, without any modification to the repo, I should be able to get the same/similar results as provided? Or should I modify some hyper-parameters settings in the repo?

minkbaek commented 3 years ago

Yes, I'm using per-chain FASTA sequences as inputs (For T1031,33,35,37,39,40-43, they are domains from single protein, T1044. I modeled them together as a single protein (T1044) and split them into corresponding targets for evaluation. I used UniRef30 2020-01, BFD, and pdb100_2020Mar11. The input MSA generation was slightly different from what we're using now for the Robetta web-server (the version posted in this repo). Basically, it uses more fine-grained e-values (from e-80 to e-1) for sequence search. Also, we generated 45 models using pyrosetta script rather than 15 for CASP14 benchmarks. When we tested these changes (simplifying sequence search, reducing the number of structures to sample) on the test cases, there were no big differences and we were able to get similar results. But, I recommended using provided MSA and templates if you want to reproduce/make a fair comparison to other neural networks.

jiaxiang-wu commented 3 years ago

Got it. Thanks for your detailed explanation. Closing the issue.

jiaxiang-wu commented 3 years ago

Hi Baek, For the "casp14_models.tar.gz" file attached above, I am wondering whether the five models provided for each target are ranked in order (by Rosetta energy?), i.e. the first model is the "best" model in terms of the Rosetta energy.