New data structure for biomolecular structures

bicycle1885 commented 7 years ago

This is a proposal to define a new data structure for biomolecular structures (e.g. protein). Here is a prototype code I wrote during Today's JuliaCon Hackathon: https://gist.github.com/bicycle1885/768298687e00aef2826a7a2ad9fa129d.

Basically, it has three data types:

Atom: xyz coordinate type of an atom with its ID number.
AtomCoords: a matrix type of natoms x {x,y,z}
Structure: a biomolecular structure storing atom coordinates, residues, and chains with metadata

Atom has an ID which is unique in a structure and xyz coordinates with 32-bit floating point numbers. This is 16 bytes in total, which is a multiple of 64-bit, and hence will fit contiguously in memory on most modern computers. AtomCooeds is a collection of these Atom structs. I think this is efficient and handy because coordinates are densely stored in memory (less cache miss!) and it works like a usual matrix of natoms x 3 size. Finally, Structure holds coordinates and other information (e.g. atom name, occupancy, etc.). A Structure object shares its data with other structures as BioSequence does. So we can quickly create a new substructure (e.g. chain, residue) from a large structure. Structure is also extensible: it has type-parameterized metadata fields for each atom, residue and chain. I currently use NamedTuples.jl to hold metadata but it will be incorporated into the base library in the future (https://github.com/JuliaLang/julia/pull/22194).

A problem of our data types for biomolecular structures is that they are tightly bound to the PDF file format. Since the new data structure I propose is more generic, I think we can support more file formats like mmCIF or MMTF.

I still need to find a way to handle disordered atoms and other gory details. But I think it would be better to show my ideas to you and have discussion before going into the details.

TransGirlCodes commented 7 years ago

@jgreener64 as the original author of the Struct code, and someone working on protein structure, what do you think about these design ideas?

jgreener64 commented 7 years ago

Yes, I can see the advantage of this type of data structure.

My original code was written with the PDB format in mind and is largely based on that (and Biopython's Bio.PDB module) as I find it to represent the most popular file format by far. It spends a lot of effort dealing with the gory details of the format, particularly disorder, and the API is aimed at day-to-day (rather than computationally-intensive) structural bioinformatics tasks, though it benchmarks well due to Julia being Julia.

This proposed data structure would allow for efficient computations on coordinates etc and gives a more abstract type of molecule representation. Proper consideration should be given to the gory details though.

My understanding is that mmCIF/MMTF could be parsed without ambiguity into the current data structure. I'm not saying it's the best structure for them, but I think it can be done.

I have done a bit of work on a Julia MMTF encoder/decoder actually which I haven't looked at in while but I might dig out.

I think a question we have to ask ourselves is "What are people going to use the Bio.Structure module for?". The data structure should be highly efficient, that's part of the BioJulia ethos, but equally important is that the data structure is useful to people's needs.

bicycle1885 commented 7 years ago

My understanding is that mmCIF/MMTF could be parsed without ambiguity into the current data structure. I'm not saying it's the best structure for them, but I think it can be done.

I'm not perfectly sure but mmCIF and MMTF are extensible (i.e. we can add more fields to each atom) while the current data structure is not. I think this will be a problem when we want to read data files with some metadata or we want to attach some metadata to in-memory structures.

I think a question we have to ask ourselves is "What are people going to use the Bio.Structure module for?". The data structure should be highly efficient, that's part of the BioJulia ethos, but equally important is that the data structure is useful to people's needs.

We really care about this. Our code must be efficient and easy to use. I believe my proposal is more efficient in most cases due to its cache efficiency and the memory footprint is lower. Also, I'm going to make it more like a dataframe that would be familiar for those who use R or Python's pandas. That is, we can apply CPU/RAM-intensive algorithms and dataframe-like operations to structures. For example, some people in my lab does MD and have lots of snapshots of a trajectory. In that case, we may want to do some dataframe-like operations to select regions of interest and then apply some CPU-intensive algorithms to get insights.

Generally speaking, the usability of a data structure depends on APIs we offer and it is not directly related to a data structure itself.

jgreener64 commented 7 years ago

Yes the advantages make sense. I would be happy to look at code etc. For reference I think some of the R packages, primarily Bio3D, use something like this data structure.

One thing that should be thought about early is whether the data structure support multiple models (i.e. NMR structures) and how it would do this. In principle these are identical bar the coordinates.

bicycle1885 commented 7 years ago

Thank you. I'll take a look at Bio3D.

I'd like to work on the detailed stuff next week or the next and then create a new package (possiblely BioStructures.jl?) under the BioJulia project.

kescobo commented 7 years ago

+1 to BioStructures.jl - not that I'm likely to use this functionality ever 😆

jgreener64 commented 4 years ago

There is some great discussion on here but any further points should probably be made at BioStructures.jl.

TransGirlCodes commented 4 years ago

Thanks @jgreener64!

BioJulia / Bio.jl

New data structure for biomolecular structures #475