SasView / sasview

Code for the SasView application.
BSD 3-Clause "New" or "Revised" License
47 stars 41 forks source link

PDB loader generates bogus errors #2870

Open butlerpd opened 2 months ago

butlerpd commented 2 months ago

Describe the bug When loading some PDB files, such as the Apoferritin one, a slew of

ERROR: list index out of range

gets thrown. @krzywon says these are bogus due to something in the faster reading of pdb. The file eventually loads and can be drawn and a curve generated.

To Reproduce Steps to reproduce the behavior:

  1. In the Generic Scattering Calculator load the aporferritin pdb (which cannot be attached apparently) or try another large pdb file.
  2. see the errors in the Log Explorer

Expected behavior No errors thrown

Screenshots If applicable, add screenshots to help explain your problem.

SasView version (please complete the following information):

Operating system (please complete the following information):

krzywon commented 1 month ago

I did more digging on this and my initial guess was wrong. The issue is related to creating atomic connections to lines in files that seem to be individual atoms, but that aren't read in as atoms by the PDB reader.

To reproduce:

  1. Open SasView (verified in v5.0.5 and above).
  2. Open the Generic Scattering Calculator Tool.
  3. Open 1n04.pdb as Nuclear Data.
  4. 31 list index out of range errors are thrown

More information: In the 1n04.pdb file, after line 5801, the label for the atoms changes from ATOM to TER and then to HETATM. The ATOM lines are loaded as atomic points, but the later values are ignored. When creating atomic connections (CONECT lines), many connections are linking atoms labeled as TER and HETATM, but they are not in the list of atoms generated by the PDB reader. This is where the index out of range errors are coming from.

The real issue: This could be one of two things. The first - malformed data sets. The data sets might not have the correct labels on each line. The second - the reader does not read all values from the file correctly. The first is harder to handle and we would need to know what alternate nomenclature is used for atomic lines in PDB files.

TO start the diagnosis, we need the PDB file specification to figure which it is. If anyone has it, please link/attach it here.

krzywon commented 1 month ago

Found it: https://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html

Reading through this, HETATM represent "non-standard" chemical coordinates. These will likely need to be read in.