Closed rbdavid closed 1 year ago
Currently, the TMalign dask workflow outputs a CSV file of all alignment results as well as a CSV file of the ranked set of alignment results. These results files include all relevant metrics for each individual alignment analysis, in this order:
This is different from the APoc results, which only writes out to file the top 999 results (ranked by the TMscore1 value); no results associated with the TMscore2 metric are saved to file.
To account for these differences in file format, I am creating a separate tmalign_parser.py module file that will hold the functions needed to parse and return a pandas dataframe associated with the results.
The above code has been implemented. Waiting on Alpine to be brought back to active status before I can push those changes. Grumble grumble grumble.
Need to implement a function that parses the alignment log file written out by TMalign/USalign. (Oh yeah, we'll be switching to USalign from here on out.) In USalign, the log files for circular permtuation, semi-sequence independent, and fully-sequence independent alignments now output the residue mapping between the aligned structures (since these methods allow for alignments that do not follow linear sequences). This mapping is extremely important for recreating the alignment as well as correlating/transferring important residues in the alignment target (a PDB structure with potential metadata) to residues in the mobile structure (predicted model of a protein).
Will create the alignment log file parser in a separate issue/branch.
We currently have parsing functions for result from APoc. But, a dask workflow has been implemented to run TMalign directly without the wrapping of APoc. Results gathered from TMalign are formatted differently than those of APoc. So, new parsing functions need to be written.