Sherman-1 commented 6 months ago

Hi, Could it be possible to add an option to fill " empty " columns for the final dataframe output ? I'm thinking about cases such as :

RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA

1 1 A L 0 0 119 0, 0.0 2,-0.3 0, 0.0 33,-0.2 0.000 360.0 360.0 360.0 168.8 8.7 6.9 63.0 2 2 A T E -a 34 0A 66 31,-2.0 33,-2.1 1,-0.1 2,-0.7 -0.456 360.0-169.6 -87.8 130.5 7.7 8.8 59.8 3 3 A K E > -a 35 0A 66 -2,-0.3 3,-1.2 31,-0.2 4,-0.2 -0.850 8.5-179.0-111.3 94.7 7.6 7.5 56.2 4 4 A L G >> S+ 0 0 23 31,-2.5 4,-2.9 -2,-0.7 3,-2.0 0.786 71.6 72.4 -65.6 -32.5 7.1 10.6 54.1 5 5 A Y G 34 S+ 0 0 2 30,-0.8 -1,-0.3 1,-0.3 31,-0.1 0.709 101.0 46.1 -56.9 -26.7 7.0 8.7 50.7 6 6 A Y G <4 S+ 0 0 39 -3,-1.2 -1,-0.3 2,-0.1 -2,-0.2 0.439 115.4 47.1 -93.5 -4.1 3.5 7.4 51.7 7 7 A E T <4 S- 0 0 138 -3,-2.0 2,-0.3 1,-0.2 -2,-0.2 0.825 135.2 -0.3 -99.6 -48.8 2.4 10.9 52.8 8 8 A D >< - 0 0 57 -4,-2.9 3,-1.4 3,-0.1 -1,-0.2 -0.852 61.5-167.8-144.3 106.0 3.6 13.0 49.9

Taken from the official DSSP website. As I understand it, the first line has 4th and 5th columns empty because there is nothing to put there. The format of the file make it difficult to upload into a dataframe. Is it possible to add something like " NA " or None where there is no value to output ?

Thank you very much, Simon

drlemmus commented 6 months ago

Are you using the DSSP format or the mmCIF format? The latter has no empty columns and is (white-)space delimited. The former is column formatted. That was quite normal when the DSSP format was designed. You can read it with formatted read statement. In some languages this is more straight forward than in others; you didn't say how you read the data.

Sherman-1 commented 6 months ago

Hi,

Thanks for you quick reply. I'm trying to use the DSSP format, but as I understand it the mmCIF could fit more what's I'm trying to do ? For now I'm trying to extract the sequence coordinates ( begin / end ) of the secondary structures from the output, using either Python ( Polars/Pandas ) or Awk if some text formatting is needed.

Correct me if i'm wrong, but the best way to extract this information from the mmCIF format would be from these columns ? :

loop_ _struct_conf.id _struct_conf.conf_type_id _struct_conf.beg_label_comp_id _struct_conf.beg_label_asym_id _struct_conf.beg_label_seq_id _struct_conf.pdbx_beg_PDB_ins_code _struct_conf.end_label_comp_id _struct_conf.end_label_asym_id _struct_conf.end_label_seq_id _struct_conf.pdbx_end_PDB_ins_code _struct_conf.beg_auth_comp_id _struct_conf.beg_auth_asym_id _struct_conf.beg_auth_seq_id _struct_conf.end_auth_comp_id _struct_conf.end_auth_asym_id _struct_conf.end_auth_seq_id HELX_RH_AL_P1 HELX_RH_AL_P THR A 2 ? ASN A 18 ? THR A 14 ASN A 30
TURN_TY1_P1 TURN_TY1_P GLU A 19 ? GLU A 19 ? GLU A 31 GLU A 31
HELX_RH_AL_P2 HELX_RH_AL_P PHE A 21 ? ALA A 27 ? PHE A 33 ALA A 39
STRN1 STRN LEU A 28 ? ASN A 36 ? LEU A 40 ASN A 48
TURN_TY1_P2 TURN_TY1_P VAL A 37 ? PHE A 39 ? VAL A 49 PHE A 51 [ ... ]

I'm trying to re-annotate pdb files coming from the following database : https://opm.phar.umich.edu/ as they lack SS information ( One example pdb file : https://opm-assets.storage.googleapis.com/pdb/2bng.pdb )

drlemmus commented 6 months ago

Yes, these records are fine. Note that the files from OPM are not properly PDB formatted, which may be an issue if you want to run DSSP on them. If I understand correctly, OPM uses is based on PDB entries. In that case you can use the DSSP databank to get the annotations: https://pdb-redo.eu/dssp/download. That should save you a bit of time.

drlemmus commented 6 months ago

There are decent mmCIF parsers for Python. This is a list of known mmCIF parsers and tools: https://mmcif.wwpdb.org/docs/software-resources.html. Note that BioPython is not great but https://pdbeurope.github.io/pdbecif/ is the thing my students use.

Sherman-1 commented 6 months ago

Thank you very much for your help !

Have a great day, Simon

PDB-REDO / dssp

Fill empty columns #80

RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA