aqlaboratory / proteinnet

Standardized data set for machine learning of protein structure
MIT License
867 stars 132 forks source link

No secondary structure data in CASP12 TFRecord files. I didn't check others... #5

Open ufimtsev opened 5 years ago

mircare commented 5 years ago

"* CASP12 test set is incomplete due to embargoed structures. Once the embargo is lifted we will release all structures." https://github.com/aqlaboratory/proteinnet

ufimtsev commented 5 years ago

It doesn't apply to CASP12 only. None of the files contains secondary structure data

basantab commented 5 years ago

Hi,

Yes, looks like the text-based records of CASP11 are also missing the secondary structure entries.

alquraishi commented 5 years ago

Thanks for bringing this to my attention. I will update the files soon with secondary structure information.

crvineeth97 commented 5 years ago

@alquraishi Thank you for the amazing resource. May I please know when this issue will be fixed? Thanks!

AlexeyG commented 5 years ago

I checked a number of splits for a number CASPs - both in TFRecord and in textual formats. I wasn't exhaustive, but it seems like secondary structure data is missing from all of them.

Can the information still be (easily) added to the datasets?

alquraishi commented 5 years ago

Hi @AlexeyG, yes the information can be added easily. It's mostly already there, I just need to expose it. Stay tuned.

oskar-taubert commented 5 years ago

Hi @alquraishi, can you estimate when this is going to happen? I would like to use ProteinNet to compare a couple of predictor architectures and a standardized dataset would be useful. Thanks.

uoda commented 5 years ago

I am Harun Or Rashid,doing masters thesis in Protein sequence, structure and function analysis at University of Wuerzburg,Germany under Prof.Dr.Thomas Dandekar who is the chair of department of Bioinformatics.

I have been trying to implement your RGN network to predict protein 3d structure from sequence. I followed the instruction in your Github :https://github.com/aqlaboratory/rgn

I installed the cpu version of tensorflow 1.10.0 including python 2.7 and setproctitle in conda environment.

I made directory as you mentioned. RGN7/data/ProteinNet7Thinning90/(testing,training,validation)folder RGN7/runs/CASP7/ProteinNet7Thinning90/ CASP7.config

I ran script: python protling.py RGN7/runs/CASP7/ProteinNet7Thinning90/CASP7.config -d RGN7 -p

But i got the out one CASP7.log file which i attached here.

I do not understand the error and wheres the problem. Would you please help me to solve this issue and help me please to do this properly.

MoZZez commented 4 years ago

Hello, also encountered this issue(specifically in CASP11), are there still plans on adding secondary structures in observable future?

deepgradient commented 4 years ago

Hi @AlexeyG, yes the information can be added easily. It's mostly already there, I just need to expose it. Stay tuned.

Any progress in resolving this issue?

spetti commented 4 years ago

I am also very interested in using the secondary structure information-- are there still plans to release this info? Thanks!

alquraishi commented 4 years ago

As an interim solution I added JSON files for the secondary structure data. I say interim because there are a few caveats: the data is not currently integrated within the rest of ProteinNet. Instead, these JSON files are on their own and in an ad hoc file format. There are two files, one corresponding to single domain entries coming from ASTRAL and the other to whole proteins coming from the PDB. The IDs of these entries match those of the original ProteinNet files, and so it should be easy to cross-reference them. The only other wrinkle is that not all ProteinNet entries have secondary structure information, but the vast majority do. The files are linked to in the main README page.

deepgradient commented 4 years ago

@alquraishi Dear Mohammed,

I've just checked the ids of validation and test datasets of CASP-11 with your added JSON files. Unfortunately, I cannot match any ids between the CASP-11 and JSON files. Is there any possibility that I am doing something mistakenly?

I would be thankful if you could kindly let me know about the possibility of adding the secondary structures info directly to the original CASP datasets?

Thanks in advance.