Create selectors for 'altloc' and the other various AG fields

lgsmith commented 1 year ago

Is your feature request related to a problem? Please describe

I often run into situations where I need to break up an AG into alt-loc sets, and one can't select on these. There may be other fields from PDBs or other source files that also don't have a selector in the selection langauge (for example, does B-factor?). It seemed natural to assume that if loos provided a field as metadata, then I could select on that metadata, but that's not how it works at present.

Describe the solution you'd like

I want an expansion to the selection language to include all the fields.

Describe alternatives you've considered

I currently write for loops over atoms within an AG and write conditional tests to filter things I need from fields the selection language doesn't index (specifically, for me, altloc).

Additional context

Nothing other than to say I'm not sure how many other metadata fields there are that are missing a selector, but they should probably all get checked.

agrossfield commented 1 year ago

This probably isn't happening any time soon. The selection parsing code is very brittle and hard to change, for reasons @tromo could explain better, and loos is really intended to focus on simulations rather than general structure analysis.

For what it's worth, though, the performance should be only trivially different for a manual loop vs. selection, since that's what selection does. If the loop really needs to be in C++, there are code examples of how to do this (look in KernelActions.hpp)

lgsmith commented 1 year ago

It's not performance that's the issue but grace. Writing a selection loop to select on a property of common loos metadata isn't a natural way to think about an analysis. It feels janky to have to do this, and it's a common problem for comparing simulations to room temp. crystal structures, where altlocs (and therefore AGs that have too many atoms in them by default) are the norm, so it's not a super obscure use case. Chapin is likely running into the same issue when writing analysis code for the OpenForceField benchmark, where structural comparisons to room temp. structures are prioritized. Also, I don't think it's the only field that can't be accessed from the selection language.

If it's so hard to maintain the selection language, should we change how it is implemented? It's a key feature--it has to keep pace with the life and times of the library.

On Wed, Apr 12, 2023 at 11:36 AM Grossfield Lab @.***> wrote:

This probably isn't happening any time soon. The selection parsing code is very brittle and hard to change, for reasons @tromo https://github.com/tromo could explain better, and loos is really intended to focus on simulations rather than general structure analysis.

For what it's worth, though, the performance should be only trivially different for a manual loop vs. selection, since that's what selection does. If the loop really needs to be in C++, there are code examples of how to do this (look in KernelActions.hpp)

— Reply to this email directly, view it on GitHub https://github.com/GrossfieldLab/loos/issues/105#issuecomment-1505488202, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAZVYNBQW5CG66G2XVONBDXA3DX3ANCNFSM6AAAAAAW32YTAA . You are receiving this because you authored the thread.Message ID: @.***>

agrossfield commented 1 year ago

Replacing the selection language code is on the TODO list, but that too is a lot of work for very little impact, so it's far down the list.

agrossfield commented 1 year ago

Maybe I'm missing something, but your problem also strikes me as a failure to use the "curate once, use multiple times". If you're picking out the most populated side chain configuration, that's a do-once kind of thing -- write a script to do it, write it out, and thereafter use that.

As I think about it, your particular problem doesn't lend itself to the selection language anyway, since you're not just eg picking out atoms with occupancy above a certain threshold or b-factor below a threshold (and recall by design the selection language operates 1 atoms at a time).

tromo commented 1 year ago

I agree completely...I've been itching to replace the lex/yacc code with something like ANTLR, but it'll probably take a sabbatical to actually get it done.

lgsmith commented 1 year ago

If a user is using a python script to compare say an RMSD to a crystal structure that has altlocs in the selection sub-structure, they'll get a length error for the atomic group. This might take more naive users longer to diagnose than it did me because they'll need to realize that altloc atoms, by virtue of not being selectable, are indistinguishable from the atoms they are trying to select using the selection string, and therefore will get grouped with the AG they make to do the RMSD calculation regardless of how they troubleshoot the selection string.

What if we update the length mismatch error message with a phrase that says something like 'make sure you checked your reference structure for altloc atoms' or similar. Alternatively, when a PDB is read that has altlocs in it a warning could be printed that says something like 'You are using a structure with atoms in alternative locations (Atom.altLoc() to access). Selections comparing atoms in this group to analogous groups from simulations are unlikely to have the same number of atoms in them by default, even if you believe them to be chemically identical.'

GrossfieldLab / loos