MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.3k stars 647 forks source link

1 A^3 boxes from EM structures in the PDB #2599

Closed IAlibay closed 4 years ago

IAlibay commented 4 years ago

Context

This is related to this discussion: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/mdnalysis-discussion/IKYv18MqlFs

From a cursory look at a couple of systems (6NT5, 6CVM, 3JAC), it looks like PDB files for EM structures have a tendency to have the following CRYST1 records:

CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1

As dicussed in the dicussion thread, this is causing issues with periodic selections as it is folding the system into a cube of 1 A.

Proposed solution

My solution here would be to catch cases where the CRYST1 record is defining a 1 A cube, warn the user and make the system dimensionless.

Code to reproduce the behavior

Very slightly altered code from Lucas Siemons' post.

import MDAnalysis as mda
import numpy as np

u = mda.Universe('6cvm.pdb')

cur_sele = u.select_atoms('around 0.1 (resid 4 and name CA and segid A)')

ref_ca = u.select_atoms('resid 4 and name CA and segid A')

print('sele len: ', len(cur_sele))
for i in cur_sele:
    dist = np.linalg.norm(i.position-ref_ca[0].position)
    print('Distance: %f '%(dist), i)

Current version of MDAnalysis

richardjgowers commented 4 years ago

Yeah catching these boxes seems like a good idea and can’t really harm anyone (except people with sub 1A boxes).

On Mon, Mar 9, 2020 at 14:39, Irfan Alibay notifications@github.com wrote:

Context

This is related to this discussion: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/mdnalysis-discussion/IKYv18MqlFs

From a cursory look at a couple of systems (6NT5, 6CVM, 3JAC), it looks like PDB files for EM structures have a tendency to have the following CRYST1 records:

CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1

As dicussed in the dicussion thread, this is causing issues with periodic selections as it is folding the system into a cube of 1 A. Proposed solution

My solution here would be to catch cases where the CRYST1 record is defining a 1 A cube, warn the user and make the system dimensionless. Code to reproduce the behavior

Very slightly altered code from Lucas Siemons' post.

import MDAnalysis as mdaimport numpy as np

u = mda.Universe('6cvm.pdb')

cur_sele = u.select_atoms('around 0.1 (resid 4 and name CA and segid A)')

ref_ca = u.select_atoms('resid 4 and name CA and segid A') print('sele len: ', len(cur_sele))for i in cur_sele: dist = np.linalg.norm(i.position-ref_ca[0].position) print('Distance: %f '%(dist), i)

Current version of MDAnalysis

  • Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.version)") 0.20.2-dev0
  • Which version of Python (python -V)? 3.7.6
  • Which operating system? Ubuntu 18.04.3 LTS

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2599?email_source=notifications&email_token=ACGSGB4NWPD7HLCGCI2RWPDRGT5SXA5CNFSM4LEKGWN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ITS5R4Q, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB4BNB6JCEEE7GMAPOTRGT5SXANCNFSM4LEKGWNQ .

IAlibay commented 4 years ago

@siddharthjain1611 this issue is indeed currently free to work on if you wish to do so. I would however suggest possibly not over-extending yourself by looking to fix multiple issues concurrently (namely #2469), and instead focusing on one so that you can get a PR merged before the GSOC application window opens. Doing so means that it also doesn't block off issues from other prospective GSOC applicants.

Edit: looks like the comment this was aimed at has now been deleted.

siddharthjain1611 commented 4 years ago

Sir, I would first focus on the other issue, therefore, please feel free to assign someone else to this issue who is more worthy.

orbeckst commented 4 years ago

Note:

REMARK 247 ELECTRON MICROSCOPY
REMARK 247  THE COORDINATES IN THIS ENTRY WERE GENERATED FROM ELECTRON
REMARK 247  MICROSCOPY DATA. PROTEIN DATA BANK CONVENTIONS REQUIRE
REMARK 247  THAT CRYST1 AND SCALE RECORDS BE INCLUDED, BUT THE VALUES
REMARK 247  ON THESE RECORDS ARE MEANINGLESS EXCEPT FOR THE CALCULATION
REMARK 247  OF THE STRUCTURE FACTORS.

Technically, it’s the users responsibility to provide correct input.

In practice, this can easily lead to very wrong results when working with PDB files.

I really dislike having to put a lot of special casing in readers but if

CRYST1    1.000    1.000    1.000  90.00  90.00  90.00 P 1

is a common thing in cryo-EM structures (which will soon make up a significant portion of the PDB) then we can’t just slip them through. Perhaps we should have a way to set the box to None (or whatever we use for no box) when we encounter either these magic box size or if we have another clear indication that the box is non-sensical.

TODO

richardjgowers commented 4 years ago

So the codes pasted above give no dimensions info when using fetch_mmtf so it looks like it's a strictly PDB problem at least.

We could try and do some sort of string match in the remarks if we really wanted, but I just think it's fine to warn and delete the box info.

IAlibay commented 4 years ago

So to tackle the first "TODO" from @orbeckst's post, I had a quick thrawl through the RSCB PDB to investigate the frequency of cryo-EM PDB CRYST1 records matching:

CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1

Out of 2286 entries matching the a text search for "cryo-em", 2109 (~92%) matched the above CRYST1 record.

I've not had a thorough look at the cases where the CRYST1 record was different, but in ~ dozen cases it was a case that the CRYST1 record was partially empty in some way, in ~ 50 cases there were no CRYST1 records at all, in the remainder of cases some kind of non-1 A^3 box was defined.

orbeckst commented 4 years ago

I really don't want to start digging through REMARKS.

Let's define what we mean by "a sensible box", document, throw it out & warn by default, and give user an option to keep whatever rubbish is in the file if they really want to.

0xethsign commented 4 years ago

is someone currently working on this issue?

orbeckst commented 4 years ago

@smrpn, as far as I can see there's no PR linked to this issue at the moment so by all means go ahead and put up a PR.

GomaruSai commented 4 years ago

Hello! I would like to know if someone is working on this issue?

0xethsign commented 4 years ago

@GomaruSai i'm not

IAlibay commented 4 years ago

@GomaruSai as far as I'm aware no one has opened a PR on this, please do go ahead and open one if you'd like to.

Oscuro-Phoenix commented 4 years ago

Hi! I'm trying this out.

If you trace the issue back to groups.py in the core module, it seems like unless the periodic key in the dictionary is not specified as periodic=False (during select_atoms function call) it would take it as True by default. Could you (@IAlibay) kindly clarify this ?

IAlibay commented 4 years ago

I was under the impression that @GomaruSai wanted to work on this issue.

If you trace the issue back to groups.py in the core module, it seems like unless the periodic key in the dictionary is not specified as periodic=False (during select_atoms function call) it would take it as True by default. Could you (@IAlibay) kindly clarify this ?

I'm not sure I follow you fully here, but yes this is happening because periodicity is applied by default if a CRYST record exist in the PDB. That being said, we wouldn't want to change the default value of the periodicity flag.

Instead we just want to pick up a case where an 1 A^3 box is defined as a possible EM structure that was given a default non-real CRYST record. On identifying this, we'd then just warn the user, and not set the box dimensions.

Oscuro-Phoenix commented 4 years ago

I'm kind-of struggling to find the genesis of this default behavior on encountering CRYST. I found that the PDB.py has a section that reads CRYST1 but it doesn't explain the behavior. Any hints on what I might be missing?

IAlibay commented 4 years ago

I'm kind-of struggling to find the genesis of this default behavior on encountering CRYST. I found that the PDB.py has a section that reads CRYST1 but it doesn't explain the behavior. Any hints on what I might be missing?

So the section you're probably looking for is: https://github.com/MDAnalysis/mdanalysis/blob/5f513831301492ffcf054c7b57ee6092bf153c41/package/MDAnalysis/coordinates/PDB.py#L391-L400 What we want to do is identify where the unit cell values being fed here define a 1 A^3 box, i.e. when the CRYST1 record is: CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1.