DeepRank / deeprank2

An open-source deep learning framework for data mining of protein-protein interfaces or single-residue variants.
https://deeprank2.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
35 stars 10 forks source link

Bug: No error message when no value calculated for HSE #402

Open Max1461 opened 1 year ago

Max1461 commented 1 year ago

Describe the bug When generating and saving graphs, made from a sample set of pdbs containing created micro-envirnonments from pMHC structures, to a hdf5 file the following error occured for me:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/max/anaconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/max/anaconda3/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/max/deeprank-core/deeprankcore/query.py", line 197, in _process_one_query
    graph.write_to_hdf5(output_path)
  File "/home/max/deeprank-core/deeprankcore/utils/graph.py", line 220, in write_to_hdf5
    node_features_group.create_dataset(
  File "/home/max/anaconda3/lib/python3.9/site-packages/h5py/_hl/group.py", line 161, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/home/max/anaconda3/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 88, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1663, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1687, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1747, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-8-6150c50ecc2f> in <module>
      4 feature_modules = [importlib.import_module('deeprankcore.features.' + name) for name in feature_names]
      5 # Generate graphs and save them in hdf5 files
----> 6 output_paths = queries.process(output_path, feature_modules = feature_modules)

~/deeprank-core/deeprankcore/query.py in process(self, prefix, feature_modules, cpu_count, combine_output, grid_settings, grid_map_method, grid_augmentation_count)
    271         with Pool(self.cpu_count) as pool:
    272             _log.info('Starting pooling...\n')
--> 273             pool.map(pool_function, self.queries)
    274 
    275         output_paths = glob(f"{prefix}-*.hdf5")

~/anaconda3/lib/python3.9/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    362         in a list that is returned.
    363         '''
--> 364         return self._map_async(func, iterable, mapstar, chunksize).get()
    365 
    366     def starmap(self, func, iterable, chunksize=None):

~/anaconda3/lib/python3.9/multiprocessing/pool.py in get(self, timeout)
    769             return self._value
    770         else:
--> 771             raise self._value
    772 
    773     def _set(self, i, obj):

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

This was not a very clear error message to figure out what was actualy going wrong. After adding some print statements myself it turned out that this was caused because in the following steps from the graph.py script:

# store node features
node_key_list = list(self._nodes.keys())
first_node_data = list(self._nodes.values())[0].features
node_feature_names = list(first_node_data.keys())
print(node_feature_names)
for node_feature_name in node_feature_names:
    print(node_feature_name)
    node_feature_data = [node.features[node_feature_name] for node in self._nodes.values()]
    #print(node_feature_data)
    node_features_group.create_dataset(node_feature_name, data=node_feature_data)

Because for some of the pdb files no HSE could be calculated, this give something that is a None or empty value that can/is not (be) added, causing a discrapancy in node features between the graphs resulting in the error shown above.

It would be nice that if the HSE feature, or any of the other features, is not calculated or can not be calculated a clear error message that indicates which feature runs into this problem, so that the use can easily determine which feature was the problem without having to spit through the process themself.

If you want to reproduce this error, I have an example file of a pdb that works: 1AKJ_1_ILE.txt And a file that runs into this error: 1AKJ_2_LEU.txt

Running the second file with:

queries = QueryCollection()

# Append data points
queries.add(ProteinProteinInterfaceResidueQuery(
    pdb_path = "1AKJ_2_LEU.pdb",
    chain_id1 = "A",
    chain_id2 = "C",
    targets = {
        "binary": 0
    }
))
output_path = os.path.join(output_directory, project_id)
# Set feature to be used by feature modules named in feature_names
feature_names = ['components', 'contact', 'exposure', 'surfacearea']
feature_modules = [importlib.import_module('deeprankcore.features.' + name) for name in feature_names]
# Generate graphs and save them in hdf5 files
output_paths = queries.process(output_path, feature_modules = feature_modules)

Should reproduce the error if wanted. The main "issue" is mainly the lack of error message from the exposure.py script where no message/error was given to indicate that the problem lay there.

EDITED by @DaniBodor to fix the code blocks.

gcroci2 commented 1 year ago

We could add something NaN-like in such cases, doing a check before storing both node and edge features. We can add a warning message in the exposure.py script as well. @DaniBodor we'll discuss who will pick this up next week

DaniBodor commented 1 year ago

Maybe we can look for a way to check for NaN/missing values systematically across all features after generating the graph, and output an error with missing features and/or an option to set such values to 0.

gcroci2 commented 1 year ago

Maybe we can look for a way to check for NaN/missing values systematically across all features after generating the graph, and output an error with missing features and/or an option to set such values to 0.

There are several opinions about how to set nan values, and it depends a lot on the feature (e.g. different values for different features) so I wouldn't enforce any default. I would say is up to the user to decide how to fill them up. Integrating this in the code base giving at the same time great flexibility about which value to fill in for each feature, without breaking anything and doing it properly, I think is not trivial at all. Also, we need to think about a way that doesn't increase the overhead too much, that's why I would do the check before writing the features to the hdf5 files.

We could also just add a nan count on each histogram, or create a dict during the graphs generation and at the end print out how many nans are present in each feature (so something the user can have access to and can notice). Together with this, we can improve the warnings in the feature modules in such cases.

DaniBodor commented 1 year ago

Good point about defaulting NaNs.

I still think it would be a good idea to have a default check for NaNs during graph creation (maybe after each feature module is called or something) and before hdf5 is created, so that for future/custom feature modules, if it is not handled within the module, there is still a default error message that makes it clear what the problem is/where it happened.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

DaniBodor commented 1 year ago

@gcroci2 , has this been addressed/solved yet?

gcroci2 commented 1 year ago

@gcroci2 , has this been addressed/solved yet?

Nope. We can add a check for hse feature, whenever it's computed, and log a warning in case it's empty/none; then we need to default such cases to some hdf5-acceptable value, such as a negative integer value (hse dominium is always positive, right?)

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity.