`ValueError` probably caused from some missing parallel results

muammar commented 3 months ago

I'm processing an SDF file that fails with the following error:

"""
Traceback (most recent call last):
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/joblib/_utils.py", line 72, in __call__
    return self.func(**kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/joblib/parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/joblib/parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/datamol/utils/jobs.py", line 83, in _run
    return fn(args, **fn_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/medchem/structural/lilly_demerits/_demerits.py", line 324, in _score
    results["mol"] = mols
    ~~~~~~~^^^^^^^
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/pandas/core/frame.py", line 4311, in __setitem__
    self._set_item(key, value)
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/pandas/core/frame.py", line 4524, in _set_item
    value, refs = self._sanitize_column(value)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/pandas/core/frame.py", line 5266, in _sanitize_column
    com.require_length_match(value, self.index)
  File "/efs/home/muammar.elkhatib/miniconda3/envs/ml/lib/python3.11/site-packages/pandas/core/common.py", line 573, in require_length_match
    raise ValueError(
ValueError: Length of values (4666) does not match length of index (4658)

Based on the traceback error, some molecules are not returning results when computed with the parallel backend. Because there is a mismatch between the original pd.DataFrame and the one with results, pandas cannot proceed. I will try to understand what molecules are failing but I think it would be good if at least the library could catch the error and populate the result with np.nan. Would you have any suggestions?

Best,

maclandrol commented 3 months ago

Good catch, can you check whether all your molecules parses correctly without any issues ? (use the datamol to_mol function for that).

Ideally, you have to filter out all invalid molecules because of how the lilly code handles them.

muammar commented 3 months ago

Good catch, can you check whether all your molecules parses correctly without any issues ? (use the datamol to_mol function for that).

Ideally, you have to filter out all invalid molecules because of how the lilly code handles them.

They all parse without issues. I used this code:

Thanks for your fast reply 😄

maclandrol commented 3 months ago

dm.to_mol can return None. Can you check if any of the molecule is None ? Also it helps to standardize the list of molecules so that the smiles are canonical.

muammar commented 3 months ago

dm.to_mol can return None. Can you check if any of the molecule is None ? Also it helps to standardize the list of molecules so that the smiles are canonical.

I passed the Mol objects to rdMolStandardize.Cleanup and created a pandas Series to count all nans. Ther are zero. Let me know if you would require more information.

Thanks.

maclandrol commented 3 months ago

Ok, this is indeed weird. If you are able to share your SDF for me to debug, that would be nice. Otherwise if you have an alternative SDF, it will be helpful.

muammar commented 3 months ago

Ok, this is indeed weird. If you are able to share your SDF for me to debug, that would be nice. Otherwise if you have an alternative SDF, it will be helpful.

Thank you for your fast responses. I will check if I can share the SDF file for you to debug. I don't have another SDF that could be used to reproduce this error.

maclandrol commented 3 months ago

@muammar, any updates on this to share ?

muammar commented 3 months ago

@muammar, any updates on this to share ?

I had to use the ruby implementation of the rules to keep the ball rolling. The library presenting the problem is the Maybridge HitCreator. I'm unsure whether I can share the SDF, but they have a request form. Thank you for your fast responses, @maclandrol

datamol-io / medchem

`ValueError` probably caused from some missing parallel results #23