DeepRank / deeprank2

An open-source deep learning framework for data mining of protein-protein interfaces or single-residue variants.
https://deeprank2.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
32 stars 11 forks source link

Residue not found #613

Closed imerelli closed 1 month ago

imerelli commented 2 months ago

Hi, I'm trying the SRV tutorial, but I get an error with this command

>>> queries.process(                                                                                                                                                                                            
...     prefix=os.path.join(processed_data_path, "residue", "proc"),                                                                                                                                            
...     feature_modules=[components, contact],                                                                                                                                                                  
...     cpu_count=8,                                                                                                                                                                                            
...     combine_output=False,                                                                                                                                                                                   
...     grid_settings=grid_settings,                                                                                                                                                                            
...     grid_map_method=grid_map_method,                                                                                                                                                                        
... )                                                                                                                                                                                                           

Graph/Query with ID residue-srv:A:6:Phenylalanine->Cysteine:pdb1tff ran into an Exception (ValueError: Residue not found in data_raw/srv/pdb/pdb1tff.ent: A 6), and it has not been written to the hdf5 file. Mo
re details below:                                                                                                                                                                                               
Residue not found in data_raw/srv/pdb/pdb1tff.ent: A 6                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                              
  File "/opt/tools/deg/deeprank2/deeprank2/query.py", line 477, in _process_one_query                                                                                                                           
    graph = query.build(self._feature_modules)                                                                                                                                                                  
  File "/opt/tools/deg/deeprank2/deeprank2/query.py", line 202, in build                                                                                                                                        
    graph = self._build_helper()                                                                                                                                                                                
  File "/opt/tools/deg/deeprank2/deeprank2/query.py", line 290, in _build_helper                                                                                                                                
    raise ValueError(msg)                                                                                                                                                                                       
ValueError: Residue not found in data_raw/srv/pdb/pdb1tff.ent: A 6

Indeed looking at the pdb I don't see the A 6 residue... can you help me in solving this? The documentation is not really detailed for the SRV part, I used deeprank-mut in the past and I was wondering which are the differences with this new approach.

gcroci2 commented 2 months ago

Hi @imerelli, thank you for bringing this issue to our attention.

It appears that the problem you're encountering is related to the ENT input file (pdb1tff.ent), which is missing the A 6 residue. Our software handles this situation by catching the ValueError exception and notifying you about the issue without causing the code to fail. However, it means that the processed data point (in this case, a single residue variant graph created from pdb1tff.ent) is not being written to the HDF5 file as expected.

This issue should not occur with the tutorials' provided data. To resolve this, I will investigate further to understand why this discrepancy is happening and determine if any adjustments are needed to the data or the software. I suspect it may be related to the input data itself rather than a software bug, but I'll verify this and get back to you in the coming weeks.

Please feel free to reach out if you have any other questions or concerns in the meantime.

imerelli commented 1 month ago

Hi, any new about this issue?

DaniBodor commented 1 month ago

Hi @imerelli .

We are still looking into it, but it is unclear why some of these are flagging errors at the moment.

For the sake of carrying on with the tutorial at the moment: the Errors are being displayed when running the tutorial, but you may have noticed that the cell does execute (there is a little green checkmark). This is because we are catching the error and allowing you to proceed nonetheless. For you this means that you can carry on running the tutorial despite this issues and there should not be any downstream issues due to this problem (except that fewer datapoints are used for training, but for the sake of the tutorial that does not really matter, the number of data points is already too low to gather real information, it is just intended for demonstration purposes). If you use your own data for this, which is well curated to avoid calling non-existent residues in the target data, you should not encounter these issues. If you do, please reach out again.

For us it means that we should look further into why this is happening for our sample data, and that we should provide clearer user feedback when issues like this does happen. I have created a new issue (#635) for us specifically to improve this, which we will hopefully get to in the next weeks (although summer period may slow things down a bit).

Please let us know if this is clear and/or whether you are encountering any other issues with the notebook.

I will close this issue for now, but if there is anything that needs to be resolved, please reply here anyway and we can re-open it.