PatWalters / benchmark_map4

Benchmarking the MAP4 fingerprint in regression models
MIT License
4 stars 2 forks source link

many molecules in the dataset do not pass molecular standardization #7

Open UnixJunkie opened 1 month ago

UnixJunkie commented 1 month ago
target  failure_rate
acet    99.6%
erb1    99.8%
estr    84.5%
lck_    43.9%
UnixJunkie commented 1 month ago

I used this tool: https://github.com/UnixJunkie/molenc/blob/master/bin/molenc_std.py with the -p option (preserve stereochemistry).

One problem is a kind of a performance bug from rdkit: finding the canonical tautomer fails for many molecules (a maximum of 1000 are being enumerated; this is quite slow).