assign zenodo DOI - Githubissues

orbeckst commented 2 years ago

Once the licensing #1 is sorted out, I recommend making a release and associating it with a Zenodo package and DOI so that it becomes easier to refer to the benchmark.

cbouy commented 2 years ago

Agreed!

But first I'd like to actually run the benchmark before making a release, since it's supposed to create several report files that can then be easily read by others once uploaded to this repo. Unfortunately I don't have access to my workstation at the university anymore, and my laptop isn't going to cut it.

Do you think one of the coredevs could run the benchmark and then create a PR to upload the report files? It should take around 48h to run the whole thing on a workstation with 8 cores/16 processors

orbeckst commented 2 years ago

Yes, sure, I can run it at one of our workstations. Various 16 cores and one 20 cores.

What needs to be done?

cbouy commented 2 years ago

Great, thanks!

First install conda (and optionally install mamba to speed up the installation of dependencies)

Then clone the repo somewhere appropriate (temporary gzipped files will be generated in child directories):

git clone https://github.com/MDAnalysis/RDKitConverter-benchmark.git
cd RDKitConverter-benchmark

Then install the dependencies with make install CONDA=mamba (remove CONDA=mamba if you didn't install mamba). It will create a separate conda env called rdkitconverter which is automatically activated when running the make commands.

Finally run all the steps by simply typing make. It should automatically use all the logical processors on the machine, but you can manually set the number of processes with make N_WORKERS=XXX.

This last make command will first fetch ChEMBL 30 compounds, then preprocess the dataset, then run the actual benchmark, and finally generate the report, with progress bars at each step.
Once the last step has been run, you should be able to directly push all the files that are in the results/ directory (or make a PR).

I reckon all the various steps can be done in around 48h with 16 workers and decent broadband speed but I would put 72h of walltime just in case.

cbouy commented 2 years ago

@orbeckst did you manage to find some time to run the benchmark?

orbeckst commented 2 years ago

Thanks for the reminder, I am now running it on 18 cores.

orbeckst commented 2 years ago

Processing molecules:  14%|█▍        | 303728/2136187 [02:54<16:46, 1820.20it/s]

orbeckst commented 2 years ago

Started the benchmark:

Accuracy: 100.00% | Progress:   0%|          | 1332/1942004 [01:08<30:44:05, 17.54it/s]

orbeckst commented 2 years ago

Accuracy: 99.67% | Progress:  39%|███▉      | 756386/1942004 [11:43:04<14:07:48, 23.31it/s]

orbeckst commented 2 years ago

Unfortunately, I got an error:

Accuracy: 99.32% | progress:  79%|███████▉  | 1538768/1942004 [23:17:01<6:06:05, 18.36it/s]
Traceback (most recent call last):
  File "~/Projects/MDA/RDKitConverter/RDKitConverter-benchmark/scripts/benchmark.py", line 43, in <module>
    for i, result in enumerate(pool.imap_unordered(validate_entry, fi)):
  File "~/.conda/envs/rdkitconverter/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7fc6fc9a0e80>'. Reason: 'PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python' failed")'
make: *** [Makefile:58: data/chembl_failed.smi] Error 1

What should I do, @cbouy ?

cbouy commented 2 years ago

Oh no :( Will investigate later today as to why this happened (the error trace isn't very helpful unfortunately) and I'll get back to you

cbouy commented 2 years ago

@orbeckst I've uploaded the results for the incomplete run here, I'll try to debug in the coming days or so.

The next steps for me are:

generating a file containing the molecules that haven't been processed yet (will make debugging easier)
investigate the bug
rerun the benchmark on the unprocessed molecules
concatenate results

Edit: quickly skimming through the molecules that failed the benchmark, looks like the main hurdles for the converter are:

N=[N+]=[N-] (2,315 matches)
[N+]#[C-] (615 matches)
[B] (285 matches)
[n+]1ccc2cc[nH]c2c1 (207 matches)

But I'd have to do a more thorough analysis

orbeckst commented 2 years ago

Is the error in https://github.com/MDAnalysis/RDKitConverter-benchmark/issues/2#issuecomment-1127065146 related to the molecules themselves or something else?

cbouy commented 2 years ago

Haven't looked at it yet but since the origin is an ArgumentError from Boost my guess is that during the inferring, one of the molecules fails silently at some point and becomes a None object, which is then passed to an RDKit function, which then triggers the boost argument error because it expected a Mol and not a None type.

Don't know why it's not caught by my try: ... catch Exception as e block though

cbouy commented 2 years ago

Found the issue: one of the molecules which is initially read from the ChEMBL SD file and converted to SMILES during the preprocessing step, actually produces an invalid SMILES because of incorrect aromaticity perception: https://github.com/rdkit/rdkit/issues/5078

For now I'll just remove this molecule from the benchmark, and add some sanity checks in the preprocessing script so that this doesn't happen again in future benchmarks.

Seems to be working correctly now:

Accuracy: 99.94% | Progress:   7%|▋         | 27222/404005 [23:48<4:23:56, 23.79it/s]

cbouy commented 2 years ago

Aaaaand it's done! Final accuracy: 99.14% not bad!!

I'll link the repo to zenodo, make a release and add the DOI on the README

Edit: done

MDAnalysis / RDKitConverter-benchmark

assign zenodo DOI #2