DeepRank / deeprank

This repository has been integrated in https://github.com/DeepRank/deeprank2
Apache License 2.0
145 stars 27 forks source link

cannot generate hdf5 for some complexes #147

Open LilySnow opened 4 years ago

LilySnow commented 4 years ago

code and data: /projects/0/deeprank/BM5/issue147

Create new database ./1ZM4.hdf5

Start creating HDF5 database: ./1ZM4.hdf5 Creating database : 0%| | 0/1 [00:00<?, ?it/s, mol=1ZM4_ranair-it0_9812.pdb] Processing PDB file: ./decoys/1ZM4/1ZM4_ranair-it0_9812.pdb /home/lixue1/tools/pdb2sql/pdb2sql/pdb2sqlcore.py:194: UserWarning: Missing chainID and set it with segID warnings.warn("Missing chainID and set it with segID") Traceback (most recent call last): File "hdf5_generate.py", line 47, in database.create_database(prog_bar=True) File "/nfs/home6/lixue1/deeprank/deeprank/generate/DataGenerator.py", line 439, in create_database self._rotate_feature(molgrp, axis, angle, mol_center) File "/nfs/home6/lixue1/deeprank/deeprank/generate/DataGenerator.py", line 1581, in _rotate_feature xyz = data[:, 1:4] IndexError: too many indices for array Creating database : 0%| | 0/1 [00:12<?, ?it/s, mol=1ZM4_ranair-it0_9812.pdb]

LilySnow commented 4 years ago

I tested your new fix code. The hdf5 file is able to be generated. But I got a warning:

/home/lixue1/tools/pdb2sql/pdb2sql/pdb2sqlcore.py:466: UserWarning: SQL query get an empty warnings.warn('SQL query get an empty')

Can we report what's the warning is for? I remember you said this pdb is missing something, which results in one of the features cannot be calculated or something like this. User may want to know the problems in the pdb file.

LilySnow commented 4 years ago

I also tested this code version on issue 139. I got the same warning message: SQL get an empty. A user may need to know what is the problem and how deeprank works around it.

NicoRenaud commented 4 years ago

yeah these warnings are quite common and not very helpful. I guess in that case it's because there is for example no polar-polar contact (or something on those lines)

The issue is that it's in pdb2sql so it cannot know what which feature was computed at the time .... You can always ignore the warning with python -W ignore

But Maybe we should remove that warning ... @CunliangGeng what do you think ?

CunliangGeng commented 4 years ago

You'r right, Nico, It's hard to provide more info in pdb2sql. I agree to remove that warning and check the return value of get in DeepRank if necessary.

NicoRenaud commented 4 years ago

that means we need a new release of pdb2sql just for that though ....

LilySnow commented 4 years ago

Sorry, but if you remove that warning, will the user know when there is a problem with the pdb?

I thought we need to report a warning from deeprank something like "CB is not available in xxx.pdb. We are using center of residue for pssm mapping"

LilySnow commented 4 years ago

For this specific pdb file, which has no polar-polar interaction, what does deeprank do?

Ok. I saw:

list(f['1ZM4_ranair-it0_9812/features_raw/RCD_polar-polar_raw']) []

Maybe we should ask deeprank to 1) report polar-polar interactions are not able to be calculated and 2) remove such pdb files from the hdf5 file (because an empty feature may cause issues in the later deep learning training)?

NicoRenaud commented 4 years ago

Sorry, but if you remove that warning, will the user know when there is a problem with the pdb?

I thought we need to report a warning from deeprank something like "CB is not available in xxx.pdb. We are using center of residue for pssm mapping"

It's not necessarily a problem of the PDB that leads to this warning. There can be no polar-polar (for example) contact even if the two chains are in contact.

All features use as center :

NicoRenaud commented 4 years ago

For this specific pdb file, which has no polar-polar interaction, what does deeprank do?

Ok. I saw:

list(f['1ZM4_ranair-it0_9812/features_raw/RCD_polar-polar_raw']) []

Maybe we should ask deeprank to 1) report polar-polar interactions are not able to be calculated and 2) remove such pdb files from the hdf5 file (because an empty feature may cause issues in the later deep learning training)?

That won't lead to any issue in the training, the grid will simply be empty

NicoRenaud commented 4 years ago

I think we will always find some PDB for which the data generation will fail. We should try to cover most cases but covering them all is unrealistic (especially with the amount of hours we have left :( )

Maybe we should add a best practice for the data quality, i.e. a description of how to prepare data for DeepRank to be able to exploit them.

LilySnow commented 4 years ago

I agree that it is not realistic to cover most cases. Can we print warnings: what feature for which pdb is causing SQL get an empty?

When a polar-polar grid is empty for one pdb and other pdb has polar-polar grids, what will DL do?

NicoRenaud commented 4 years ago

No we unfortunately cannot tell which feature caused the warning, we could say which pdb did though.

By empty I meant that the grid value is 0 for each voxel. So it won't affect the training.

sonjageorgievska commented 4 years ago

Just ran accidentally into the conversation, I don't know the full context, but just to be sure, are "empty" grids (all 0s) excluded from training, is there already filtering in place? I wonder about "it wont affect the training"

LilySnow commented 4 years ago

Hi Sonja, when an interface has no polar-polar interactions (one feature), Nico said it will generate a grid for this feature with zeros.

Hi Nico, I checked the grid value. They are not zeros. It is an empty list:

In [6]: list(f['1ZM4_ranair-it0_9812/features/RCD_polar-polar']) Out[6]: []

And this:

In [12]: list(f['1ZM4_ranair-it0_9812/mapped_features/Feature_ind/RCD_polar-polar_ch ...: ainA/value']) Out[12]: []

NicoRenaud commented 4 years ago

hmmm that's strange. I'll check tomorrow

NicoRenaud commented 4 years ago

yes it's because we store the mapped features in a sparse representation to save storage space (I forgot about that). So an empty grid has 0 index and 0 values. It should work during the training :)

LilySnow commented 4 years ago

Perfect :) I will regenerate the data then.