bebop / poly

A Go package for engineering organisms.
https://pkg.go.dev/github.com/bebop/poly
MIT License
665 stars 70 forks source link

PDBx Parser #297

Open TimothyStiles opened 1 year ago

TimothyStiles commented 1 year ago

We're getting structural but first we need a parser for pdbx so we can scrape and parse all of protein databank.

I've seen a couple of go parsers for PDBx but am unsure of their quality. In the end we need to be able to parse a lot of these files:

https://www.wwpdb.org/deposition/preparing-pdbx-mmcif-files

rkrishnasanka commented 1 year ago

Here's the actual file format - http://www.wwpdb.org/documentation/file-format

I might be interested in taking this on in a month or so. Wouldn't mind outlining what needs to get done first.

One of the underlying formats is STAR - https://pubs.acs.org/doi/10.1021/ci00019a005

TimothyStiles commented 1 year ago

https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/beginner%E2%80%99s-guide-to-pdb-structures-and-the-pdbx-mmcif-format

carreter commented 1 year ago

@rkrishnasanka Pinged you over on the Discord about this, pinging you here as well - I'm thinking of picking this up and was wondering if you'd made any progress or would like to collaborate!

carreter commented 1 year ago

What actually is PDBx/mmCIF, anyway?

Did a bit more research on this, and it seems like the underlying syntax for PDBx/mmCIF is CIF v1.1, which is a proper subset of STAR and a glorified way of storing key:value pairs.

On top of this syntax exist the Dictionary Definition Languages, which allow for the description of "dictionaries" that grant domain-specific meaning (+ validation) to the data items stored in a CIF file. Seems like DDL is a self-validating format, which is pretty neat! There are two competing DDL versions currently used to store the PDBx/mmCIF, DDLm and DDL2. Seems like DDLm is a superset of DDL2, so it's probably worth targeting DDLm in our efforts.

So, in summary: PDBx/mmCIF's syntax is defined by CIF v1.1, and its semantics are defined by DDLm (the syntax for which is again CIF v1.1, and semantics for which is again defined by DDLm (yay recursion!)).

Action Items

It seems to me that the next two tasks are clear:

Where to from there?

Based on my understanding, it would then be possible to write code generation tools that take a DDLm dictionary and generate the proper Go structs to represent the data. The alternative would be to manually create Go structs based on the current PDBx/mmCIF dictionary, which seems like a slog that would be prone to error. I have no idea how to go about writing code generation tools though, so I will absolutely need help on this!

See also

See Westbrook et al. 2022 for a nice overview of the current state of the PDBx/mmCIF ecosystem, as well as this tutorial on wwPDB for a brief but less rigorous intro to the PDBx/mmCIF format.

carreter commented 12 months ago

Update on this: CIF parser is nearing completion, will put a PR up soon™️.

TimothyStiles commented 9 months ago

@carreter where is this on your roadmap for after this semester? By chance met @ethanholz at a conference the other week who joined our discord, wrote this in zig, and had some pretty good insights.

https://github.com/ethanholz/sonic-pdb-parser

ethanholz commented 9 months ago

I am very happy to help in whatever way I can! I am a Go dev by trade but started exploring that parser on the side with Zig.

carreter commented 9 months ago

@carreter where is this on your roadmap for after this semester? By chance met @ethanholz at a conference the other week who joined our discord, wrote this in zig, and had some pretty good insights.

https://github.com/ethanholz/sonic-pdb-parser

Probably not super high priority.

Semester ends on 12/13, gonna relax for a bit then ramp up to full time paid work on poly during January! I'll be focusing on tools for Prof. Weiss's lab, but this will be something I'm still working on during that time.

Currently have a working mmCIF parser, but the DDLm part still needs to be written.

carreter commented 8 months ago

.take

github-actions[bot] commented 8 months ago

Thanks for taking this issue! Let us know if you have any questions!