Discngine / fpocket

fpocket is a very fast open source protein pocket detection algorithm based on Voronoi tessellation. The platform is suited for the scientific community willing to develop new scoring functions and extract pocket descriptors on a large scale level. fpocket is distributed as free open source software. If you are interested in integrating fpocket in an industrial setting and require official support, please contact Discngine (www.discngine.com).
MIT License
271 stars 60 forks source link

Chain names longer than two characters #107

Closed nurulnad closed 5 months ago

nurulnad commented 1 year ago

Is your feature request related to a problem? Please describe. Hi, for mmcif files, chain names (_atom_site.label_asym_id) can be longer than two characters. For example, PDB ID 4yzv has chains "AAA", "AAB"... I tried to run fpocket on files with longer chain names and fpocket truncate the chain names in the result (which essentially renders it unusable).

Describe the solution you'd like Can fpocket work with chains longer than two characters?

pschmidtke commented 1 year ago

that would require a fix in the read_mmcif.c & maybe (hopefully not the molfileplugin). It should be in theory not too com^plicated to address. However, what is the expected output behaviour when writing output as pdb files? What should happen to the chain names?

nurulnad commented 1 year ago

Thanks for reply. In the PDB, entries with chain names longer than 2 characters do not possess PDB files at all, only mmCIF. I would say it is safe to truncate a three-letter chain code when you output as pdb files (if you would want to output as pdb for those particular entries at all).

pschmidtke commented 1 year ago

let me see when I can squeeze that in & report back to you. In the mean time if you want to have a look it'll happen in here: https://github.com/Discngine/fpocket/blob/master/src/read_mmcif.c mostly but there are also implications on when selecting chains to identify epitopes using fpocket, as this now would need support for more than 2 as well.

what is the current limit for mmcif for chain names? Is there finally something flexible where one can use actual words or is it 3 now and will be 4 in 5 years ? 😇

nurulnad commented 1 year ago

Thank you! Mmcif tries to be very flexible and there's no limit. I can see in our database (PDBe internal) that it's set to 3, but BMRB has set it to char(12)! I mean, entry_id will be lengthened to 8 characters in the near future...

xvlaurent commented 8 months ago

There is also an incoming change in RCSB structures to be release from december/january: they will use 5 characters codes for new ligands/residues.

https://www.rcsb.org/news/65007774d78e004e766a969d

There is also 12 characters ID incoming, but I dont think there is any impact.

pschmidtke commented 6 months ago

Tests to run:

pschmidtke commented 5 months ago

everything @nurulnad should be available in current master