coleygroup / molpal

active learning for accelerated high-throughput virtual screening
MIT License
159 stars 36 forks source link

Bug in explorer #15

Closed rebirthjin closed 2 years ago

rebirthjin commented 2 years ago

Hello, I am trying to run my data. While the program is running, I got the problem of two cases.

First error is below,

Exception raised! Intemediate state saved to "molpal_stock/chkpts/iter_25_2021-12-01_22-26-33/state.json"

Traceback (most recent call last):
  File "/home/njgoo/Data1/program/molpal/run.py", line 73, in <module>
    main()
  File "/home/njgoo/Data1/program/molpal/run.py", line 57, in main
    explorer.run()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 326, in run
    self.explore_batch()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 422, in explore_batch
    return sum(valid_scores)/len(valid_scores)
ZeroDivisionError: division by zero

I think that valid_scores could be empty so len(valid_scores) is zero.

Second error is below,

Finished exploring!
Exception raised! Intemediate state saved to "molpal_stock/chkpts/iter_51_2021-12-02_10-16-51/state.json"
Traceback (most recent call last):
  File "run.py", line 73, in <module>
    main()
  File "run.py", line 57, in main
    explorer.run()
  File "/home/njgoo/Data1/program/molpal/molpal/explorer.py", line 329, in run
    print(f'FINAL TOP-{self.k} AVE: {self.top_k_avg:0.3f} | '
TypeError: unsupported format string passed to NoneType.__format__

I wonder that the my job is raised to wrong process.

What do I check a intermediate data for fixing bug?

Thank you

davidegraff commented 2 years ago

The first error looks like all the scores from that round of optimization were invalid, causing molpal to calculate 0/0 and raising that error. That’s an edge case we can look at covering in the code, but it’s generally a cause for concern when every objective calculation failed. I’m undecided on how we should handle this in the code.

the second error looks to be a result of self.top_k_avg being None at the end of optimization. Again, this is likely due to there being too few valid scores from which to calculate a top-k average. This really should not be the case (reasonably), so I’m curious why so many of your objective evaluations are failing

rebirthjin commented 2 years ago

@davidegraff Thanks for your advice.

For lookup process, the score in CSV file was positive value that re-calculated from docking score. And then, I removed "--minimize" in objective option to apply maximize a optimization. So the parameter setting could make that objective calculation failed?

I am trying to run the process that change a negative score and add "--minimize" option. The process would be complete without error. I notice you again.

Have a good day!

davidegraff commented 2 years ago

Are you sure that your lookup objective is being constructed properly?

rebirthjin commented 2 years ago

@davidegraff What do I check a lookup objective for proper construction?


smiles,score
C[C@@]1(c2ccccc2)OCCO[C@H]1C(=O)O,-3.961000
Cc1ncn(C[C@H]2CC(C)(C)CO2)c1C,-4.435000
CC(=O)N1C[C@H]2CNC[C@@]2(C(=O)N(C)Cc2ccoc2)C1,-5.111000
O[C@H]1C[C@@H]2CCCN(C1)C2,-4.209000
CNC(=O)c1cccc(Nc2nc(O)nc(O)c2C#N)c1,-5.455000
Cc1nc(CNc2ccc(F)c(N3CCCS3(=O)=O)c2)cs1,-6.235000
Cc1c(/N=N/c2cccc(C)c2C)c(-c2ccccc2)nn1C(=S)S,-6.337000
OCc1cc(-c2ccc(Cl)c(Cl)c2)ccn1,-3.907000
Cc1noc(C)c1COC(=O)c1ccc(Cl)cc1N1CCCC1=O,-5.407000
O=C(O)c1ccccc1S(=O)(=O)n1ccc(=O)[nH]c1=O,-5.967000
CC(C)CC(=O)NC[C@@]12CNC[C@@H]1COC2,-4.212000

As I changed docking score in csv into all of negative values with "--minimize", the process was finished completely. But I got maximum positive value in all_explopred_final.csv

Is it wrong parameter for objective option?

MolPAL will be run with the following arguments:
  batch_sizes: [0.01]
  budget: 1.0
  cache: False
  checkpoint_file: None
  chkpt_freq: 0
  cluster: False
  conf_method: mve
  config: njkoo_config.ini
  cxsmiles: False
  ddp: False
  delimiter: ,
  delta: 0.1
  epsilon: 0.0
  final_lr: 0.0001
  fingerprint: pair
  fps: /home/njgoo/Data1/program/molpal/libraries/ZINC20_Stock.h5
  init_lr: 0.0001
  init_size: 0.01
  invalid_idxs: []
  k: 0.0005
  length: 2048
  libraries: ['/home/njgoo/Data1/program/molpal/libraries/ZINC20_Stock.csv.gz']
  max_iters: 50
  max_lr: 0.001
  metric: random
  minimize: True
  model: mpn
  model_seed: None
  ncpu: 20
  objective: lookup
  objective_config: njkoo_lookup.ini
  output_dir: molpal_stock
  pool: eager
  precision: 32
  previous_scores: None
  radius: 2
  retrain_from_scratch: True
  scores_csvs: None
  seed: None
  smiles_col: 0
  test_batch_size: None
  title_line: True
  verbose: 0
  window_size: 10
  write_final: True
  write_intermediate: True
davidegraff commented 2 years ago

i would just add in a print statement to see what sort of values you're getting out of objective.calc(...). If all of the values failed, then there's an issue with how you're constructing your MolPAL run

albertma-evotec commented 2 years ago

I also noticed that my output files are filled with the positive scores while my lookup file has negative scores (more negative = better compound). I think the sign just got swapped during processing. image It explored compounds with more negative score progressively so I think it was doing what it supposed to

davidegraff commented 2 years ago

The output files always use positive scores, regardless of the input lookup file

rebirthjin commented 2 years ago

@davidegraff Would we change a positive score of output into negative score? Because of docking score of total energy, more negative values indicate better compounds.

And, what file do I add for print statement? I don't find function of objective.calc() Thanks for quick response!

davidegraff commented 2 years ago

Yes. The framing of MolPAL is a maximization problem. So the output reflects that by the most positive output being the best

rebirthjin commented 2 years ago

I wonder what is meaning of --minimize option. Before of your comment, I understood that run with 'minimze' option can get more negative score and run without 'minimize' option could get more positive score.

davidegraff commented 2 years ago

In docking, a more negative score is better, so you want to —minimize It. Unless of course you were trying to find the worst possible binder for your target of interest, in which case you would want to maximize it (the default assumption.) To perform a minimization, we multiply objective values by -1 under the hood, so that the rest of the program sees a maximization. You see the result of this multiplication in the output.

rebirthjin commented 2 years ago

Thanks for kind explain! As your mention, I understood the same meaning. However, I'm confused to interpret the output result in all_explored_final.csv.

I got positive value from run with --minimize option, but negative value from run without '--minimize option'.

The result was not same as your mention "The output files always use positive scores, regardless of the input lookup file" Also, it generated the opposite results that I expect.

Could you check the code for calculation of multiply objective values? I just change the out value into new value by -1.

I found a different default value for objective.

minimize: bool = False in base.py minimize: bool = True in lookup.py

Have a good day!

davidegraff commented 2 years ago

I misspoke earlier. The values in the output are not always positive, but they are always reflective of more positive values being "better" in MolPALs view. I.e., if you --minimize your objective, then the true objective values in the output should be multiplied by -1. If you maximize, then you may take the output scores as-is. The different default values are overridden by the supplied minimize value from the arguments, which is False by default.

rebirthjin commented 2 years ago

Thank you very much!