BioJulia / BioStructures.jl

A Julia package to read, write and manipulate macromolecular structures
Other
93 stars 22 forks source link

Request for symmetries #15

Open abahm opened 4 years ago

abahm commented 4 years ago

This is a great library and I'm already using it lots. I would like to evaluate positions of a protein's atoms with the full expression of symmetry in the complete unit cell or entire crystal. I can't seem to find it in the documentation for this library, nor in the code.

Expected Behavior

I would like to see documented examples of how to retrieve that atoms positions along with all the symmetries of those positions.

Current Behavior

I've written my own basic parser (below), but it only works for PDB files. Molecules in the PDB can often be dowloaded in either PDB format or MMCIF format, however some appear to only be downloadable as mmCIF, e.g. "6PEM". In these cases, downloadpdb throws an exception unless the MMCIF format is specified.

julia> BioStructures.downloadpdb("6PEM")   # no .pdb file available!
[ Info: Downloading PDB: 6PEM
┌ Error: Download failed: curl: (22) The requested URL returned error: 404 Not Found
└ @ Base download.jl:43
ERROR: failed process: Process(`/usr/bin/curl -s -S -g -L -f -o /var/folders/ry/wg4yqb1j5lz3lcyzyls96k2r0000gn/T/jl_32sXTz http://files.rcsb.org/download/6PEM.pdb.gz`, ProcessExited(22)) [22]
julia> BioStructures.downloadpdb("6PEM", file_format=MMCIF)   # works!

Possible Solution / Implementation

Perhaps two functions be added to the library. One function that can inform the user which filetypes a protein can be downloaded in, and another function that allows the user to read symmetries regardless of the file type. However there may be more natural solutions that integrate better with the BioStructures.Model (e.g. adding a field for parsed symmetries). I would be happy with anything that supported both formats and allowed me to ultimately get all the atoms in the unit cell or entire crystal.

By way of example of the second function, the code that I implemented as a workaround follows. It assumes that you have already downloaded a PDB file to a pdb_cache_directory, and it only works with .pdb format files.

function read_symmetries(protein_name)
    # get the symmetries of the complex - rotation matrix and translation vector
    # they are in the form:
    #                     index         rotation M               translation
    # REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000        0.00000
    # REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000        0.00000
    # REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000        0.00000
    #
    full_protien_name = protein_name * ".pdb"
    fn = joinpath(pdb_cache_dir, full_protien_name)
    lines = readlines(fn)
    remark = "REMARK 350   BIOMT"
    lines_defining_symmetry = filter(x->startswith(x,remark), lines)
    @assert (length(lines_defining_symmetry) % 3 == 0) "odd number of lines in symmetry section"
    nsymmetries = round(Int64, length(lines_defining_symmetry)/3)
    rotation_matries = Array{Float64,2}[]
    translation_vectors = Array{Float64,1}[]
    for i in 1:nsymmetries
        idx = (i-1)*3
        s1 = split(lines_defining_symmetry[idx+1])
        s2 = split(lines_defining_symmetry[idx+2])
        s3 = split(lines_defining_symmetry[idx+3])
        rotation_matrix = [ parse(Float64, s1[5]) parse(Float64, s1[6]) parse(Float64, s1[7]);
                            parse(Float64, s2[5]) parse(Float64, s2[6]) parse(Float64, s2[7]);
                            parse(Float64, s3[5]) parse(Float64, s3[6]) parse(Float64, s3[7]) ]
        translation_vector = [ parse(Float64, s1[8]), parse(Float64, s2[8]), parse(Float64, s3[8]) ]
        push!(rotation_matries, rotation_matrix)
        push!(translation_vectors, translation_vector)
    end
    rotation_matries, translation_vectors
end

Your Environment

jgreener64 commented 4 years ago

This is a great library and I'm already using it lots.

Thanks for the kind words - do let me know of any other issues you run into or functionality you would like.

One function that can inform the user which filetypes a protein can be downloaded in

I thought about this, but am not sure how rigid/fixed this information is. I could add a better error message to downloadpdb though, suggesting to try downloading the mmCIF file if the PDB download fails. And I could mention this in the docs, saying that the only way to guarantee the file is available is to use mmCIF. Those are on the to-do list.

function that allows the user to read symmetries regardless of the file type

You are correct in saying that PDB headers contain the symmetry information, but this information is also in the mmCIF dictionary. See for example https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies#Anchor-Biol. You could get the values using something like:

using BioStructures
downloadpdb("3C70", file_format=MMCIF)
d = MMCIFDict("3C70.cif")
d["_pdbx_struct_oper_list.matrix[1][1]"] # etc

If you want your parser to work on all PDB entries that have that information, you could parse from the mmCIF dictionary. The primacy of the mmCIF file is one reason BioStructures doesn't do any PDB header parsing.

I would like to evaluate positions of a protein's atoms with the full expression of symmetry in the complete unit cell or entire crystal.

Thankfully, if you update to BioStructures v0.8.0 you will be able to use recently-added transformation functions to help with this. See the docstrings for Transformation for more. For example you could read the translation and rotation information from the mmCIF dictionary, create a Transformation from these then run applytransform! on your model or applytransform on the coordinate array to get updated coordinates.

If you do come up with something that works, let me know and I can add an example/explanation to the documentation.

abahm commented 4 years ago

Thank you for all the information! I'll update and begin working with the mmCIF data, and have a look at the transformation functionality. Thanks again for a great library!