docking-org / pydock3

Python package wrapping the DOCK Fortran program and providing several tools built on top of it.
Other
11 stars 3 forks source link

Dockopt treats protomers/tautormers as separate unique molecules when calculating enrichment #30

Open svigneron opened 1 year ago

svigneron commented 1 year ago

The different protomers and tautomers for the same molecule get built as separate db2 files with the same ZINC id, but different numbers after the decimal points (ie ZINC00000000aBcD.0.0 vs ZINC00000000aBcD.1.0 ). Each of these get docked on their own and scored, however, only the best scoring protomer/tautormer should be considered when calculating enrichment. Dockopt currently treats every separate db2 file as a unique 'active' molecule which alters the calculated enrichment and allows for situations where a poor scoring promoter will bring down the enrichment score despite the alternative protomer scoring well compared to decoy compounds.

ianscottknight commented 1 year ago

@jir322 Please comment on this if you have any remarks.

jir322 commented 1 year ago

it would be best to unify all molecules with the same ZINC ID and count them as a single molecule, retaining only the best score for one representative of each ZINC ID.

On Tue, Apr 4, 2023 at 6:44 PM Ian Scott Knight @.***> wrote:

@jir322 https://github.com/jir322 Please comment on this if you have any remarks.

— Reply to this email directly, view it on GitHub https://github.com/docking-org/pydock3/issues/30#issuecomment-1496808984, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIR2H7U2BAAWOEGPIT5LSLW7TFANANCNFSM6AAAAAAWSFRKOQ . You are receiving this because you were mentioned.Message ID: @.***>

ianscottknight commented 1 year ago

@jir322 From DockOpt's perspective, there is no such thing as "ZINC ID". There is only the id_num column in the OUTDOCK file, which corresponds to the zincname field encoded in the .db2 file of the molecule.

(Note that zincname and id_num are both misnomers for their data types, and are partly responsible for the confusion here. E.g., it is possible for built molecules to come from somewhere other than ZINC, such as the actives in the DUDE-Z dataset, which come from RCSB PDB.)

The real problem here is that there is no general ID for molecules in the DB2 file format. One possible solution is to just use the zincname field in the .db2 file as an actual molecule ID, since DockOpt currently treats the id_num column of OUTDOCK as a molecule ID, but doing so would almost certainly only create confusion in the long run.

Another solution is to update the .db2 file format to account for this ambiguity by adding a molecule_id field (and rectifying the zincname misnomer).

Yet another solution is to adopt a naming convention for zincname field entries of the same molecule which would allow DockOpt to figure out what to treat as the same molecule. I would suggest a regex. E.g., ^.*\.\d$ would match any strings which are ZINC codes followed by a period and a number, where the number would identify the protomer / tautomer.

jir322 commented 1 year ago

let's discuss this. There is no practical way to retrospectively update all of ZINC-22 files. We can fix it in ZINC-25, but that's not for a while longer.

John Irwin UCSF Pharmaceutical Chemistry http://irwinlab.compbio.ucsf.edu

On Wed, Apr 5, 2023 at 7:42 PM Ian Scott Knight @.***> wrote:

@jir322 https://github.com/jir322 From DockOpt's perspective, there is no such thing as "ZINC ID". There is only the id_num column in the OUTDOCK file, which corresponds to the zincname field encoded in the .db2 file https://wiki.docking.org/index.php?title=DB2_File_Format of the molecule.

(Note that zincname and id_num are both misnomers for their data types, and are partly responsible for the confusion here. E.g., it is possible for built molecules to come from somewhere other than ZINC, such as the actives in the DUDE-Z dataset, which come from RCSB PDB.)

The real problem here is that there is no general ID for molecules in the DB2 file format. One possible solution is to just use the zincname field in the .db2 file as an actual molecule ID, since DockOpt currently treats the id_num column of OUTDOCK as a molecule ID, but doing so would almost certainly only create confusion in the long run.

Another solution is to update the .db2 file format to account for this ambiguity by adding a molecule_id field (and rectifying the zincname misnomer).

Yet another solution is to adopt a naming convention for zincname field entries of the same molecule which would allow DockOpt to figure out what to treat as the same molecule. I would suggest a regex. E.g., "^.*.\d$" would match any strings which are ZINC codes followed by a period and a number, where the number would identify the protomer / tautomer.

— Reply to this email directly, view it on GitHub https://github.com/docking-org/pydock3/issues/30#issuecomment-1498412544, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIR2HYGFZPBC3XVHSJ4UJ3W7YUSJANCNFSM6AAAAAAWSFRKOQ . You are receiving this because you were mentioned.Message ID: @.***>