NaegleLab / CoDIAC

Other
0 stars 0 forks source link

Unmodeled residues in the structSeq not being recorded as gaps in PDB metafile #31

Closed alekhyaa2 closed 1 year ago

alekhyaa2 commented 1 year ago

Is your feature request related to a problem? Please describe.

Unmodeled list for 3CD3 - 'unmodeled_list': [486,488, 496, 500, 507, 510, 516, 519, 538, 540, 594, 597, 822, 903]. This list is obtained from contactMap class but while PDB integration of reference and structure sequence, these are not reported as gaps in PDB metafile.

The below sequence alignment shows the unmodeled regions positions and these match correctly to the ones recorded in the unmodeled list.

Screen Shot 2023-09-05 at 9 03 37 PM
knaegle commented 1 year ago

@alekhyaa2 can you please clarify this issue? We used to have to account for unmodeled regions because we used to have to construct the structure sequence from the PDB files. Now, we get structure sequence from PDB and we no longer need to reverse engineer. Also, the integration of reference is naive of the structure files directly. Are you suggesting we need to add a new layer that accounts and adds unmodeled regions.

As a note, we still have a gaps issue. This is when what was experimentally tested has missing or additional amino acids, relative to the reference sequence being used. We saw this when we looked at some example.

alekhyaa2 commented 1 year ago

I am guessing that the residues missing in the experimental structure sequence but present in the reference sequence are treated as gaps. For example, we observe that the residues 486 to 488 are not in the PDB structure 3CD3 when we align the structure sequence, uniprot reference seq, and the canonical (ref_struct_seq). So, we refer to these as gaps, right? If so, we do not see this reflecting in our PDB metafile.

here is a snapshot of the sequence alignment with the position numbers on the top - this was missing in the image I uploaded earlier.

Screen Shot 2023-09-07 at 2 12 36 PM
alekhyaa2 commented 1 year ago

The unmodleed residues in PDB web interface is different from that we report in our contactmap class analysis. The gaps in our code refers to deletions/insertions in the structure sequence. The unmodeled residues are those that experimentally are not determined. We should use refseq instead of structseq while printing fasta files. If we are retreiving the correct refseq sequence, then we should not have these gaps. We dont see this issue anymore with all the edits made to the contactmap and pdb integration modules.