BSDExabio / structural_DLFA

Deep learning and structure based hypothesis generation for functional annotation
1 stars 0 forks source link

Add a parsing function for the TMalign results that were not created/formatted by APoc. #76

Closed rbdavid closed 1 year ago

rbdavid commented 1 year ago

We currently have parsing functions for result from APoc. But, a dask workflow has been implemented to run TMalign directly without the wrapping of APoc. Results gathered from TMalign are formatted differently than those of APoc. So, new parsing functions need to be written.

rbdavid commented 1 year ago

Currently, the TMalign dask workflow outputs a CSV file of all alignment results as well as a CSV file of the ranked set of alignment results. These results files include all relevant metrics for each individual alignment analysis, in this order:

  1. the path to the target file,
  2. TMscore normalized by mobile structure's nResidues (TMscore1),
  3. TMscore normalized by target structure's nResidues (TMscore2),
  4. RMSD of the aligned residues,
  5. Sequence Identity (abbreviated to SeqID) normalized by target structures nResidues (SeqID1),
  6. SeqID normalized by the mobile structures nResidues (SeqID2),
  7. SeqID of the aligned residues normalized by the nResidues aligned (SeqIDali),
  8. nResidues of the mobile structure (Len1),
  9. nResidues of the target structure (Len2),
  10. and nResidues that were aligned by TMalign (LenAligned).
  11. OPTIONAL, the maximum TMscore value between TMscore1 and TMscore2 (maxTMscore). This column is only present if the maxTMscore is used to rank the full set of TMalign results.

This is different from the APoc results, which only writes out to file the top 999 results (ranked by the TMscore1 value); no results associated with the TMscore2 metric are saved to file.

To account for these differences in file format, I am creating a separate tmalign_parser.py module file that will hold the functions needed to parse and return a pandas dataframe associated with the results.

rbdavid commented 1 year ago

The above code has been implemented. Waiting on Alpine to be brought back to active status before I can push those changes. Grumble grumble grumble.

rbdavid commented 1 year ago

Need to implement a function that parses the alignment log file written out by TMalign/USalign. (Oh yeah, we'll be switching to USalign from here on out.) In USalign, the log files for circular permtuation, semi-sequence independent, and fully-sequence independent alignments now output the residue mapping between the aligned structures (since these methods allow for alignments that do not follow linear sequences). This mapping is extremely important for recreating the alignment as well as correlating/transferring important residues in the alignment target (a PDB structure with potential metadata) to residues in the mobile structure (predicted model of a protein).

rbdavid commented 1 year ago

Will create the alignment log file parser in a separate issue/branch.