Open ufimtsev opened 5 years ago
It doesn't apply to CASP12 only. None of the files contains secondary structure data
Hi,
Yes, looks like the text-based records of CASP11 are also missing the secondary structure entries.
Thanks for bringing this to my attention. I will update the files soon with secondary structure information.
@alquraishi Thank you for the amazing resource. May I please know when this issue will be fixed? Thanks!
I checked a number of splits for a number CASPs - both in TFRecord and in textual formats. I wasn't exhaustive, but it seems like secondary structure data is missing from all of them.
Can the information still be (easily) added to the datasets?
Hi @AlexeyG, yes the information can be added easily. It's mostly already there, I just need to expose it. Stay tuned.
Hi @alquraishi, can you estimate when this is going to happen? I would like to use ProteinNet to compare a couple of predictor architectures and a standardized dataset would be useful. Thanks.
I am Harun Or Rashid,doing masters thesis in Protein sequence, structure and function analysis at University of Wuerzburg,Germany under Prof.Dr.Thomas Dandekar who is the chair of department of Bioinformatics.
I have been trying to implement your RGN network to predict protein 3d structure from sequence. I followed the instruction in your Github :https://github.com/aqlaboratory/rgn
I installed the cpu version of tensorflow 1.10.0 including python 2.7 and setproctitle in conda environment.
I made directory as you mentioned. RGN7/data/ProteinNet7Thinning90/(testing,training,validation)folder RGN7/runs/CASP7/ProteinNet7Thinning90/ CASP7.config
I ran script: python protling.py RGN7/runs/CASP7/ProteinNet7Thinning90/CASP7.config -d RGN7 -p
But i got the out one CASP7.log file which i attached here.
I do not understand the error and wheres the problem. Would you please help me to solve this issue and help me please to do this properly.
Hello, also encountered this issue(specifically in CASP11), are there still plans on adding secondary structures in observable future?
Hi @AlexeyG, yes the information can be added easily. It's mostly already there, I just need to expose it. Stay tuned.
Any progress in resolving this issue?
I am also very interested in using the secondary structure information-- are there still plans to release this info? Thanks!
As an interim solution I added JSON files for the secondary structure data. I say interim because there are a few caveats: the data is not currently integrated within the rest of ProteinNet. Instead, these JSON files are on their own and in an ad hoc file format. There are two files, one corresponding to single domain entries coming from ASTRAL and the other to whole proteins coming from the PDB. The IDs of these entries match those of the original ProteinNet files, and so it should be easy to cross-reference them. The only other wrinkle is that not all ProteinNet entries have secondary structure information, but the vast majority do. The files are linked to in the main README page.
@alquraishi Dear Mohammed,
I've just checked the ids of validation and test datasets of CASP-11 with your added JSON files. Unfortunately, I cannot match any ids between the CASP-11 and JSON files. Is there any possibility that I am doing something mistakenly?
I would be thankful if you could kindly let me know about the possibility of adding the secondary structures info directly to the original CASP datasets?
Thanks in advance.
"* CASP12 test set is incomplete due to embargoed structures. Once the embargo is lifted we will release all structures." https://github.com/aqlaboratory/proteinnet