NaegleLab / CoDIAC

Other
0 stars 0 forks source link

Inconsistent mutation data recorded in different columns of PDB Metadata file #27

Closed alekhyaa2 closed 1 year ago

alekhyaa2 commented 1 year ago

Is your feature request related to a problem? Please describe. Differences found in the mutations fetched by PDB.py and IntegrateStructure_Reference.py. Identified several issues and examples for all these are highlighted in the snapshot below.

  1. "MUTATIONS/MODS (Y/N)" tells us that there is a mutation (Y) but doesnt report what mutation is present in "MUTATIONS (LOCATION)"
  2. Mutations reported in columns denoted in uppercase do not match the data in "mutations" (lowercase) columns.
Screen Shot 2023-09-04 at 10 56 42 PM

These differences mainly because of different ways adopted to fetch the mutation data. The uppercase columns retrieve mutation positions directly from PDB database. The lower case columns use "alignmentTools.findDifferencesBetweenPairs" function to find differences between structure and reference sequences and then report the differences as mutations.

For PDB 4JGH, entity 4, there are two mutations that are reported in "MUTATIONS (location)" and not reported in "mutations" column. For PDB 2C9W, entity 1, there is one mutation which the PDB database doesnt report surprisingly, but this was captured in "mutations" column that compares ref and struct sequences.

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

knaegle commented 1 year ago

First clarification is I have renamed 'mutations' that comes from integration to ref variants. From updated References.md documentation about this

knaegle commented 1 year ago

Note that example PDB 4JGH where entity four has no listed mutations -- that's because we don't have this in reference, so we can't calculate mutations, relative to reference. Behavior is more clear -- Now indicates that 'ref:variants' column and states N/A (as opposed to being -1)

For other example 2C9W I have confirmed that there is a mutation, so the reference variant is correct. However, it is not caught from the PDB side (despite web interface saying it is and the reference having a P versus M in this position. Suspect this might be due to multiple entities in a PDB?

knaegle commented 1 year ago

Closing this. After having rerun the latest versions I find that the issue is in the PDB annotation records for mutations, there does not appear to be 100% reliability. However, I find that the CoDIAC alignment-based extraction of mutations relative to our reference is accurate in all these cases where they differ from what PDB returns.

For example in 2C9W I cannot find in the CIF file mutations (i.e. they are not there). The web interface agrees with our assessment, suggesting that it's a record issue in the CIF file.