DeepRank / DeepRank-Mut

Deep learning framework to predict functional effects of missense variants in human
Apache License 2.0
1 stars 0 forks source link

testing DeepRank-Mut #33

Open imerelli opened 11 months ago

imerelli commented 11 months ago

Hi, I'm trying to use DeepRank-Mut. The first problem is that I don't get how to run the tests, because in the documentation it is stated to enter in the test directory and run pytest, but this command is not valid.

However, from the root directory I can tun the test scripts that are in the test directory. Here is the output. While test/test_tools.py and test/test_atomic_features.py provide an output, test/test_generate.py and test/test_learn.py do not provide any output. Is this excepted? There is something that I can do differently?

( deeprank ) $ python test/test_atomic_features.py 
test/test_atomic_features.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/pdb2sql/pdb2sqlcore.py:263: UserWarning: Missing chainID and set it with segID
  warnings.warn("Missing chainID and set it with segID")
AtomicFeature coulomb and vdw exported to file ./atomic_pair_interaction.dat
.
----------------------------------------------------------------------
Ran 1 test in 1.645s

OK

( deeprank ) $ python test/test_generate.py 

( deeprank ) $ python test/test_learn.py 

( deeprank ) $ python test/test_tools.py 
/opt/tools/deg/DeepRank-mut/deeprank/tools/sasa.py:109: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  assert len(resA[:, 0].astype(np.int).tolist()) == len(
/opt/tools/deg/DeepRank-mut/deeprank/tools/sasa.py:110: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.unique(resA[:, 0].astype(np.int)).tolist())
/opt/tools/deg/DeepRank-mut/deeprank/tools/sasa.py:111: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  assert len(resB[:, 0].astype(np.int).tolist()) == len(
/opt/tools/deg/DeepRank-mut/deeprank/tools/sasa.py:112: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.unique(resB[:, 0].astype(np.int)).tolist())
/opt/tools/deg/DeepRank-mut/deeprank/tools/sasa.py:115: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self.xyz[chain1] = resA[:, 2:].astype(np.float)
/opt/tools/deg/DeepRank-mut/deeprank/tools/sasa.py:116: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self.xyz[chain2] = resB[:, 2:].astype(np.float)
/opt/tools/deg/DeepRank-mut/deeprank/tools/sasa.py:61: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  resSeqA = np.unique(resA[:, 0].astype(np.int))
/opt/tools/deg/DeepRank-mut/deeprank/tools/sasa.py:62: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  resSeqB = np.unique(resB[:, 0].astype(np.int))
.
----------------------------------------------------------------------
Ran 1 test in 0.412s

OK
cbaakman commented 11 months ago

Hello and thanks for showing interest in Deeprank-Mut. It's true that the tests should be run from the root and not from the test directory. Thanks for pointing this out. I'll see If I can change this in the documentation. If the tests do not show any output, that means that it ran fine. I see you got some warnings, because our code was made with an older version of numpy. I'll see if I can make those warnings go away.

imerelli commented 11 months ago

Thank you for your help. If I have understood the workflow of how to use your tool correctly, I need three scripts: 1) one for creating the database with multiple variants that I already know to be pathogenic or benign, 2) one to train the network, and 3) one to make predictions with a novel, unknown variant. Is this correct? If yes, could you please provide a template for these three scripts? There is something about the first two (but with only one variant), but nothing for step 3. Sorry, I'm not a Python programmer and I need help with this.

cbaakman commented 11 months ago

To design a script, let me start by asking how your data is organized. Deeprank-mut requires:

imerelli commented 11 months ago

deeprank.zip

I have all the requested information. Please find in attachment a zip archive with the pdb and the pssm files as well as a tentative script for the first two steps (inside there are the information about the known mutations, but I don't know how to proceed with more than one mutation). No idea on how to write the third script.

cbaakman commented 11 months ago

That's an odd looking PSSM. How does one read this? The PSSM format that deeprank works with looks like this:

dbresi pdbresn seqresi seqresn    A    R    N    D    C    Q    E    G    H    I    L    K    M    F    P    S    T    W    Y    V   IC
      0       M       1       M   -2   -2   -3   -4   -2   -1   -3   -4   -3    0    1   -2    9   -1   -4   -2   -2   -2   -2    0 1.05
      1       V       2       V    1   -2   -1   -2   -2   -1   -1    5   -2   -3   -3   -2   -2   -3   -2    1    1   -3   -3   -2 0.64
      2       L       3       L   -3   -3   -5   -5   -2   -3   -4   -5   -4    1    6   -4    1    0   -4   -4   -2   -3   -2    0 0.94
      3       S       4       S    0   -2    2   -1   -2   -1   -1   -1   -2   -3   -3   -1    0   -3   -2    5    2   -4   -3   -2 0.69
      4       E       5       E   -2   -2    0    6   -4    0    4   -1   -2   -4   -5   -1   -4   -5   -2    0   -2   -5   -4   -4 0.98
      5       G       6       G    4   -1   -1   -1   -2    1    0    3   -2   -3   -3    0   -2   -4   -2    1   -1   -4   -3   -1 0.41
      6       E       7       E   -3   -2    1    5   -5    2    5   -3   -2   -5   -5   -1   -4   -5   -3   -2   -2   -5   -4   -4 1.05
      7       W       8       W   -4    1   -4   -5   -4   -1   -4   -4    1   -3   -2   -2   -3    4   -5   -4   -4   10    3   -2 1.59
      8       Q       9       Q   -2    0    1    4   -4    5    2   -3   -1   -3   -4    1   -3   -4   -3   -1   -2   -4   -3   -1 0.67
      9       L      10       L    0    0   -1   -3   -2    0   -1   -3    1    0    2    1    3   -1   -2   -1    1   -3   -2    0 0.15
imerelli commented 11 months ago

It's in json format instead of matrix. We will convert it if necessary. But it is not mandatory right?

cbaakman commented 11 months ago

I think you need to convert it. Sorry!

cbaakman commented 11 months ago

I've looked through your generate.py script. You can use it for preprocessing known variant data (1) just as well for preprocessing unknown variant data (3). With only slight modifications for (3):

A) leave out the variant_class argument, when instantiating a PdbVariantSelection object. It's optional. B) set compute_targets=[] in the DataGenerator object. This will make sure that the preprocessing won't look for a class value to store.

imerelli commented 11 months ago

1) Concerning the pssm matrix, I achieved it using Psi-blast at NCBI. How do you usually compute this matrix? I can write a script for conversion but it will be quite tricky.

2) Before getting to part (3) our problem is to put in the database more than one variation in step (1). I don't get the python syntax that I should use.

cbaakman commented 11 months ago
  1. We used PSIBLAST too to compute this matrix, but the PSSM data needs to be mapped to the PDB file. For that, we used a tool called PSSMgen: https://github.com/DeepRank/pssmgen

  2. You need to have a CSV table file, containing your variant data: PDBID, residue number, amino acid, Then the script can use the pandas library to extract the data from that CSV file and put it in PdbVariantSelection objects.

imerelli commented 11 months ago

Hi, thank you for provicing the 3 scripts. I created the table.csv and the generate.py. Here the error I'm getting now:

python generate.py 
Traceback (most recent call last):
  File "generate.py", line 5, in <module>
    from deeprank.generate import *
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/__init__.py", line 1, in <module>
    from .DataGenerator import DataGenerator
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 17, in <module>
    from deeprank.generate import GridTools as gt
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/GridTools.py", line 13, in <module>
    from deeprank.operate import hdf5data
  File "/opt/tools/deg/DeepRank-Mut/deeprank/operate/hdf5data.py", line 7, in <module>
    from deeprank.domain.amino_acid import amino_acids
  File "/opt/tools/deg/DeepRank-Mut/deeprank/domain/amino_acid.py", line 52
    amino_acids_by_code {
                        ^
SyntaxError: invalid syntax

Also, two other minor points:

cbaakman commented 11 months ago

Sorry about the syntax error. I fixed it. I wouldn't recommend the developmental branch.

I'm trying to fix the pytest script, but it will take some time. Sorry, currently things are quite busy on our side.

imerelli commented 11 months ago

Sorry, there are still some errors.

python generate.py
Traceback (most recent call last): File "generate.py", line 22, in wildtype_amino_acid = amino_acids_by_code[row["WILDTYPE"]] TypeError: 'set' object is not subscriptable

I don't see how to solve it.

There is also and indentation error here: File "generate.py", line 53 grid_info = { ^ IndentationError: unexpected indent

But that was easy to solve.

cbaakman commented 11 months ago

Indeed. I pushed the fix. Sorry!

imerelli commented 11 months ago

Ok, now it's running. But I got this error.

803, 'O', 'O', 'A', 352, 'SER', '', '', 1.0], [-32.719, 25.157, 60.035, 2804, 'OG', 'O', 'A', 352, 'SER', '', '', 1.0], [-29.576, 22.331, 60.691, 2805, 'OXT', 'O', 'A', 352, 'SER', '', '', 1.0]] for x,y,z,rowID,name,element,chainID,resSeq,resName,iCode,altLoc,occ  
Creating database        : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:23<00:00,  1.93s/it, variant=1CR4.pdb]
Molecules with errored features are removed:
['1CR4:A:200:Glutamine->Alanine-m7857361863297430751', '1CR4:A:281:Histidine->Alanine-m3856591489522861677', '1CR4:A:284:Isoleucine->Alanine-m6427764906900908035', '1CR4:A:259:Isoleucine->Alanine-m6594343748205688219', '1CR4:A:113:Histidine->Alanine-m8237672030338117968', '1CR4:A:203:Histidine->Alanine-7177783304471683210', '1CR4:A:171:Aspartate->Asparagine-614671715635415912', '1CR4:A:175:Alanine->Phenylalanine-7912783878010190788', '1CR4:A:187:Aspartate->Alanine-3861484985167401279', '1CR4:A:255:Tyrosine->Alanine-9130804594370861896', '1CR4:A:262:Aspartate->Asparagine-m4545163254089794910', '1CR4:A:288:Glutamate->Alanine-m5086511475306243927']

# Successfully created database: train_data.hdf5

Traceback (most recent call last):
  File "generate.py", line 59, in <module>
    database.map_features(grid_info,try_sparse=True, time=False, prog_bar=True)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 1015, in map_features
    raise ValueError(f'No variants found in {self.hdf5}.')
ValueError: No variants found in train_data.hdf5.

I don't understand with my molecules are removed. The substitutions that I have indicated in the table.csv are valid.

cbaakman commented 11 months ago

Were any logs created?

If not, then add this to the beginning of your script:

import logging

logging.basicConfig(filename="deeprank-mut.log", level=logging.DEBUG)

This should output the errors, that cause your variants to be skipped.

imerelli commented 11 months ago
DEBUG:h5py._conv:Creating converter from 5 to 3
INFO:deeprank:
# Start creating HDF5 database: train_data.hdf5
INFO:deeprank:
Processing variant: 1CR4:A:200:Glutamine->Alanine-7574597180198206144
ERROR:deeprank:Error while computing deeprank.features.atomic_contacts for 1CR4:A:200:Glutamine->Alanine: Traceback (most recent call last):
  File "/opt/tools/deg/DeepRank-Mut/deeprank/operate/pdb.py", line 64, in get_atoms
    x, y, z, atom_number, atom_name, element, chain_id, residue_number, residue_name, insertion_code, altloc, occ = row
ValueError: too many values to unpack (expected 12)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 1457, in _compute_features
    feat_module.__compute_feature__(environment, distance_cutoff, featgrp, variant)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/atomic_contacts.py", line 96, in __compute_feature__
    atoms = _get_atoms_around_variant(environment, variant)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/atomic_contacts.py", line 60, in _get_atoms_around_variant
    for atom1, atom2 in get_residue_contact_atom_pairs(pdb,
  File "/opt/tools/deg/DeepRank-Mut/deeprank/operate/pdb.py", line 115, in get_residue_contact_atom_pairs
    atoms = get_atoms(pdb2sql)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/operate/pdb.py", line 66, in get_atoms
    raise ValueError("Got unexpected row {} for {}".format(row, request_s))
ValueError: Got unexpected row [[27.976, 1.967, -10.746, 0, 'N', 'N', 'A', 1, 'MET', '', '', 1.0], [28.961, 0.908, -11.035, 1, 'CA', 'C', 'A', 1, 'MET', '', '', 1.0],...
cbaakman commented 11 months ago

This suggests that something might have changed in the output of pdb2sql. But I don't see this happening at my end. Could you run:

pytest test/operate/test_pdb.py
imerelli commented 11 months ago

I'm not able to use pytest. I'm attaching the pdb file, the table file and the generate.py script, maybe you can test them Archive.zip

(deeprank) [imerelli@slurmlogin DeepRank-Mut]$ pytest test/operate/test_pdb.py ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...] pytest: error: unrecognized arguments: --cov --cov-report --cov-report term --cov-report html test/operate/test_pdb.py inifile: /opt/tools/deg/DeepRank-Mut/setup.cfg rootdir: /opt/tools/deg/DeepRank-Mut

(deeprank) [imerelli@slurmlogin DeepRank-Mut]$ pytest ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...] pytest: error: unrecognized arguments: --cov --cov-report --cov-report term --cov-report html inifile: /opt/tools/deg/DeepRank-Mut/setup.cfg rootdir: /opt/tools/deg/DeepRank-Mut

cbaakman commented 11 months ago

So apparently pdb2sql behaves differently on some PDB files. I made a unit test and a fix for it. I hope this helps you.

imerelli commented 11 months ago

Ok, the loading of the pdb file seems fixed. Now I have problems with the pssm matrix. I was able to achieve the matrix in the format required by your software using psi-blast and a python conversion script (I wan't able to use https://github.com/DeepRank/pssmgen), but now I have this problem:

Creating database        :  92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍         | 11/12 [00:14<00:00,  1.73it/s, variant=1CR4.pdb]
Processing variant: 1CR4:A:288:Glutamate->Alanine-m8091716489555719240
ERROR: Error while computing deeprank.features.neighbour_profile for 1CR4:A:288:Glutamate->Alanine: Traceback (most recent call last):
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 1457, in _compute_features
    feat_module.__compute_feature__(environment, distance_cutoff, featgrp, variant)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/neighbour_profile.py", line 100, in __compute_feature__
    pssm = _get_pssm(chain_ids, variant, environment)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/neighbour_profile.py", line 83, in _get_pssm
    raise FileNotFoundError("No PSSM for {} chain {} in {}".format(variant.pdb_ac, chain_id, environment.pssm_root))
FileNotFoundError: No PSSM for 1CR4 chain A in /opt/tools/deg/DeepRank-Mut
Traceback (most recent call last):
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 1457, in _compute_features
    feat_module.__compute_feature__(environment, distance_cutoff, featgrp, variant)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/neighbour_profile.py", line 100, in __compute_feature__
    pssm = _get_pssm(chain_ids, variant, environment)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/neighbour_profile.py", line 83, in _get_pssm
    raise FileNotFoundError("No PSSM for {} chain {} in {}".format(variant.pdb_ac, chain_id, environment.pssm_root))
FileNotFoundError: No PSSM for 1CR4 chain A in /opt/tools/deg/DeepRank-Mut

In attachment the pdb and pssm files. Archive.zip

cbaakman commented 11 months ago

It expects the chain id in the pssm filename. Like: /opt/tools/deg/DeepRank-Mut/1cr4.A.pdb.pssm

imerelli commented 11 months ago

Ok, now the pssm file is read, but there are still errors in parsing it

Creating database        :  92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍         | 11/12 [00:13<00:00,  1.75it/s, variant=1CR4.pdb]
Processing variant: 1CR4:A:288:Glutamate->Alanine-2980880347696195539
ERROR: Error while computing deeprank.features.neighbour_profile for 1CR4:A:288:Glutamate->Alanine: Traceback (most recent call last):                                                              
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 1457, in _compute_features                                                                                            
    feat_module.__compute_feature__(environment, distance_cutoff, featgrp, variant)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/neighbour_profile.py", line 100, in __compute_feature__                                                                                       
    pssm = _get_pssm(chain_ids, variant, environment)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/neighbour_profile.py", line 86, in _get_pssm                                                                                                  
    pssm.merge_with(parse_pssm(f, chain_id))
  File "/opt/tools/deg/DeepRank-Mut/deeprank/parse/pssm.py", line 47, in parse_pssm
    return parse_old_pssm(file_)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/parse/pssm.py", line 14, in parse_old_pssm
    pssm.set_amino_acid_value(residue, code, float(value))
ValueError: could not convert string to float: 'seqresn'
Traceback (most recent call last):
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 1457, in _compute_features                                                                                            
    feat_module.__compute_feature__(environment, distance_cutoff, featgrp, variant)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/neighbour_profile.py", line 100, in __compute_feature__                                                                                       
    pssm = _get_pssm(chain_ids, variant, environment)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/features/neighbour_profile.py", line 86, in _get_pssm                                                                                                  
    pssm.merge_with(parse_pssm(f, chain_id))
  File "/opt/tools/deg/DeepRank-Mut/deeprank/parse/pssm.py", line 47, in parse_pssm
    return parse_old_pssm(file_)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/parse/pssm.py", line 14, in parse_old_pssm
    pssm.set_amino_acid_value(residue, code, float(value))
ValueError: could not convert string to float: 'seqresn'

The pssm file is the one attached before, it has the following structure, which seems identical to the one you suggested:

dbresi  pdbresn seqresi seqresn A       R       N       D       C       Q       E       G       H       I       L       K       M       F       P       S       T       W       Y       V       IC
1       D       1       D       -3      -2      2       7       -4      -1      1       -2      -2      -4      -5      -2      -4      -5      -2      -1      -2      -5      -4      -4      1.00
2       Y       2       Y       -2      -2      0       -3      -3      -2      -2      -1      0       -2      -2      -2      -2      3       -3      0       0       1       7       -2      1.00
3       H       3       H       -1      0       -1      1       -3      3       1       -3      4       -2      -2      2       0       -1      -2      0       0       -2      3       -1      1.00
4       E       4       E       -2      -2      1       3       -4      0       5       -3      -2      -3      -4      -1      -3      -4      4       -1      -2      -4      -1      -1      1.00
5       D       5       D       -2      -2      -1      6       -4      -1      2       -3      -2      -4      -2      -2      -3      -3      1       1       0       -4      0       -3      1.00
6       Y       6       Y       -2      -2      1       1       -3      -2      -2      -3      2       -2      -1      -2      0       1       -1      1       0       -1      6       -2      1.00
7       G       7       G       0       -1      2       1       -3      -1      0       4       -2      -2      -3      -1      -3      -3      1       0       -1      -3      0       -1      1.00
8       F       8       F       1       -1      0       -2      0       -1      -1      -1      1       0       0       -1      1       2       -2      0       1       -2      1       -1      1.00
9       S       9       S       1       -2      2       -2      -2      -1      -2      1       -1      -1      -1      -2      1       2       -1      1       1       -2      2       0       1.00
10      S       10      S       1       -2      2       2       -2      -1      0       1       -2      -2      -1      -1      -2      -2      -2      2       1       -3      -1      0       1.00
11      F       11      F       0       -1      0       -1      -2      -2      -2      -1      -1      0       0       -2      1       3       -1      1       1       2       3       0       1.00
12      N       12      N       -2      -2      5       2       -1      0       0       1       0       -4      -4      0       -3      -3      0       1       -1      -3      1       -4      1.00
13      D       13      D       -1      -2      3       4       -4      -1      2       -1      1       -3      -2      0       -3      -3      -2      0       0       -4      0       -3      1.00
14      S       14      S       2       -2      0       0       2       0       -1      0       -2      -3      -3      -2      -2      -1      0       3       2       -3      1       -2      1.00
15      S       15      S       0       -1      0       0       2       -1      0       0       1       -2      -2      -1      -1      -3      1       3       2       1       0       -1      1.00
cbaakman commented 11 months ago

Your first line says dbresi, but it should be pdbresi.

imerelli commented 11 months ago

Ok, thank you. Unfortunately, I have another error:

Creating database        :   0%|                                                                                                                                                                                                 | 0/12 [00:10<?, ?it/s, variant=1CR4.pdb]
Traceback (most recent call last):
  File "generate.py", line 51, in <module>
    database.create_database(prog_bar=True)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 390, in create_database
    rotation_center = self._add_aug_pdb(variant_group, variant,
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/DataGenerator.py", line 1624, in _add_aug_pdb
    pdb2sql.transform.rot_axis(sqldb, axis, angle)
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/pdb2sql/transform.py", line 43, in rot_axis
    xyz = rot_xyz_around_axis(xyz, axis, angle)
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/pdb2sql/transform.py", line 106, in rot_xyz_around_axis
    return rotate(xyz, rot_mat, center)
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/pdb2sql/transform.py", line 198, in rotate
    return np.dot(rot_mat, (xyz - center).T).T + center
  File "<__array_function__ internals>", line 180, in dot
ValueError: shapes (3,3) and (3,2806,1) not aligned: 3 (dim 1) != 2806 (dim 1)
cbaakman commented 11 months ago

OK, pdb2sql seems to have a problem with your PDB file. I've done some tests. Removing the ENDMDL line from the PDB file fixes the problem.

Where did you get this PDB file? The pdb's original 1CR4 file looks different.

imerelli commented 11 months ago

Thank you. It works. Now I moved to the second scripts:

$ python test_learn.py 

========================================
=        DeepRank Data Set
=
=        Training data
=        -> train_data.hdf5
=
=
=
========================================

   Checking dataset Integrity

   Processing data set:
   Train dataset
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547 from train_data.hdf5

   Data Set Info:
   Augmentation       : 10 rotations
   Training set       : 132 conformations
   Validation set     : 0 conformations
   Test set           : 0 conformations
   Number of channels : 31
   Grid Size          : 30, 30, 30
Traceback (most recent call last):
  File "test_learn.py", line 25, in <module>
    model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
TypeError: __init__() got an unexpected keyword argument 'plot'
rgayatri commented 11 months ago

Dear Ivan Merelli,

We highly appreciate your continued interest in getting DeepRank-Mut work for your data. Thank you for using our package. Must we add that this software is better suited for those who have slightly advanced knowledge in python. The tool requires several things to prepare before getting it to run smoothly. This is not to discourage you from raising issues on github, we are happy to address them. Just a gentle nudge that it is encouraged to seek help for (maybe) critical issues.

For instance, the learning task at hand is classification instead of regression. Hence, the correct line of commands would be

neural_net = NeuralNet(dataset, cnn_class, model_type='3d',task='class',cuda=False, metrics_exporters=[OutputExporter(run_directory), TensorboardBinaryClassificationExporter(run_directory)])

The error is thrown as the classification task uses TensorFlow for plots. The argument 'plot' is from the master version DeepRank for protein complexes. To view plots after training, you would need to install tensorboard.

Also, I notice you do not have validation or test datasets; you can divide your training data if you wish by specifying the following in the learn.py script: divide_trainset= [0.8, 0.2]

You also have the option of feeding your own validation and test datasets.

I would recommend you to go through the codes in DataSet.py, NeuralNet.py and model3d.py to see what options would work best for your learning task.

Thanks again for your interest.

imerelli commented 11 months ago

Dear Gayatri Ramakrishnan,

Thank you for your help. Your tool is very interesting, but objectively difficult to use. Essentially, there isn't still a working example in the repository. The problem is also that there is no much documentation and there are some inaccuracies in the explanation, such as that pytest is not usable and that the PSSM matrix is actually mandatory, while it is listed as optional.

I can't go through all your code to understand how it works, it was already very complicated to create the PSSM matrix. I just want to be able to model mutations in my protein, and we are trying to do this, so I thank you. I think that setting up a working example could also be useful for you. Once this working example will be completed, you can certainly use it for documentation, so this work is important for everyone.

That said, I did not understand your suggestion. Perhaps neural_net should be model? Should dataset be data_set? Do I need to import cnn_class? And yet it still tells me

NameError: name 'OutputExporter' is not defined. 

Also, I wouldn't know where to insert divide_trainset= [0.8, 0.2], since that variable does not exist. Please, once the database is created with the first script, could you please provide me with a script to perform the learning of the network in the simplest way possible?

cbaakman commented 11 months ago

Actually, PSSM is optional. But if you omit it, then you must remove 'deeprank.features.neighbour_profile' from the feature list. This feature is computed from PSSM.

The OutputExporter problem can be solved by importing it, as shown in the readme.

imerelli commented 11 months ago

Hi, thank ypu for your help. I certainly made some progress. I solved some issues according to your suggestions, but now there is something out of my capacity. I'm pasting here the code (maybe you can paste it in your readme) and the the result of running it:

$ cat learn.py 
from deeprank.learn import *
from deeprank.learn.model3d import cnn_reg
from deeprank.models.metrics import OutputExporter, TensorboardBinaryClassificationExporter
import torch.optim as optim
import numpy as np

# preprocessed input data
hdf5_path = 'train_data.hdf5'

# output directory
out = './my_deeplearning_train/'

# declare the dataset instance
data_set = DataSet(
            hdf5_path,
                grid_info={
                            'number_of_points': (10, 10, 10),
                                    'resolution': (10, 10, 10)
                                        },
                    select_feature='all',
                        select_target='class',
                        )

# create the network
model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
                          cuda=False,metrics_exporters=[OutputExporter(out), 
                          TensorboardBinaryClassificationExporter(out)])

# change the optimizer (optional)
model.optimizer = optim.SGD(model.net.parameters(),
                                    lr=0.001,
                                    momentum=0.9,
                                    weight_decay=0.005)

# start the training, this will generate a model file named `best_valid_model.pth.tar`.
model.train(nepoch=50, divide_trainset=0.8, train_batch_size=5, num_workers=1)

$ python learn.py 

========================================
=        DeepRank Data Set
=
=        Training data
=        -> train_data.hdf5
=
=
=
========================================

   Checking dataset Integrity

   Processing data set:
   Train dataset
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547 from train_data.hdf5

   Data Set Info:
   Augmentation       : 10 rotations
   Training set       : 132 conformations
   Validation set     : 0 conformations
   Test set           : 0 conformations
   Number of channels : 31
   Grid Size          : 30, 30, 30
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv3d-1        [-1, 4, 29, 29, 29]             996
         MaxPool3d-2        [-1, 4, 14, 14, 14]               0
            Conv3d-3        [-1, 5, 13, 13, 13]             165
         MaxPool3d-4           [-1, 5, 6, 6, 6]               0
            Linear-5                   [-1, 84]          90,804
            Linear-6                    [-1, 1]              85
================================================================
Total params: 92,050
Trainable params: 92,050
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 3.19
Forward/backward pass size (MB): 0.92
Params size (MB): 0.35
Estimated Total Size (MB): 4.46
----------------------------------------------------------------

========================================
=        Convolution Neural Network
=        model   : 3d
=        CNN      : cnn_reg
=        features : AtomicDensities_ind
=                    C
=                    N
=                    O
=                    S
=        features : Feature_ind
=                    accessibility
=                    charge
=                    coulomb
=                    pssm_ALA
=                    pssm_ARG
=                    pssm_ASN
=                    pssm_ASP
=                    pssm_CYS
=                    pssm_GLN
=                    pssm_GLU
=                    pssm_GLY
=                    pssm_HIS
=                    pssm_ILE
=                    pssm_LEU
=                    pssm_LYS
=                    pssm_MET
=                    pssm_PHE
=                    pssm_PRO
=                    pssm_SER
=                    pssm_THR
=                    pssm_TRP
=                    pssm_TYR
=                    pssm_VAL
=                    residue_information_content
=                    variant_probability
=                    vdwaals
=                    wild_type_probability
=        targets  : class
=        CUDA     : False
========================================

: Batch Size: 5
: 106 confs. for training
: 26 confs. for validation
: 0 confs. for testing
running epoch 0 on 21 batches
loading variant 1CR4:A:200:Glutamine->Alanine-m3000895191136890148_r002 from train_data.hdf5
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r007 from train_data.hdf5
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700_r003 from train_data.hdf5
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r005 from train_data.hdf5
loading variant 1CR4:A:284:Isoleucine->Alanine-3112033218337307991_r002 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487_r004 from train_data.hdf5
                -> mini-batch: 1 
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700_r005 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487_r005 from train_data.hdf5
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r009 from train_data.hdf5
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484_r005 from train_data.hdf5
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700_r006 from train_data.hdf5
                -> mini-batch: 2 
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r001 from train_data.hdf5
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487_r007 from train_data.hdf5
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r008 from train_data.hdf5
loading variant 1CR4:A:284:Isoleucine->Alanine-3112033218337307991 from train_data.hdf5
                -> mini-batch: 3 
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r006 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487_r006 from train_data.hdf5
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659_r010 from train_data.hdf5
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r005 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487_r001 from train_data.hdf5
                -> mini-batch: 4 
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484_r006 from train_data.hdf5
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r004 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487_r010 from train_data.hdf5
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r007 from train_data.hdf5
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659_r007 from train_data.hdf5
                -> mini-batch: 5 
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r010 from train_data.hdf5
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700_r010 from train_data.hdf5
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r010 from train_data.hdf5
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r002 from train_data.hdf5
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r001 from train_data.hdf5
                -> mini-batch: 6 
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484_r004 from train_data.hdf5
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268_r010 from train_data.hdf5
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r002 from train_data.hdf5
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700_r008 from train_data.hdf5
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659 from train_data.hdf5
                -> mini-batch: 7 
loading variant 1CR4:A:284:Isoleucine->Alanine-3112033218337307991_r004 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487_r008 from train_data.hdf5
loading variant 1CR4:A:284:Isoleucine->Alanine-3112033218337307991_r008 from train_data.hdf5
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700_r009 from train_data.hdf5
loading variant 1CR4:A:200:Glutamine->Alanine-m3000895191136890148_r005 from train_data.hdf5
                -> mini-batch: 8 
loading variant 1CR4:A:259:Isoleucine->Alanine-m5958799002590729623_r009 from train_data.hdf5
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r009 from train_data.hdf5
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484 from train_data.hdf5
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268_r004 from train_data.hdf5
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484_r002 from train_data.hdf5
                -> mini-batch: 9 
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r004 from train_data.hdf5
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268_r002 from train_data.hdf5
loading variant 1CR4:A:259:Isoleucine->Alanine-m5958799002590729623_r004 from train_data.hdf5
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r001 from train_data.hdf5
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162 from train_data.hdf5
                -> mini-batch: 10 
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484_r001 from train_data.hdf5
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268_r003 from train_data.hdf5
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484_r007 from train_data.hdf5
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r007 from train_data.hdf5
loading variant 1CR4:A:259:Isoleucine->Alanine-m5958799002590729623_r001 from train_data.hdf5
                -> mini-batch: 11 
loading variant 1CR4:A:200:Glutamine->Alanine-m3000895191136890148_r004 from train_data.hdf5
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r008 from train_data.hdf5
loading variant 1CR4:A:284:Isoleucine->Alanine-3112033218337307991_r009 from train_data.hdf5
loading variant 1CR4:A:259:Isoleucine->Alanine-m5958799002590729623_r006 from train_data.hdf5
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r005 from train_data.hdf5
                -> mini-batch: 12 
loading variant 1CR4:A:200:Glutamine->Alanine-m3000895191136890148_r008 from train_data.hdf5
loading variant 1CR4:A:284:Isoleucine->Alanine-3112033218337307991_r003 from train_data.hdf5
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268_r008 from train_data.hdf5
loading variant 1CR4:A:200:Glutamine->Alanine-m3000895191136890148_r001 from train_data.hdf5
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268_r006 from train_data.hdf5
                -> mini-batch: 13 
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659_r008 from train_data.hdf5
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268 from train_data.hdf5
loading variant 1CR4:A:200:Glutamine->Alanine-m3000895191136890148_r006 from train_data.hdf5
loading variant 1CR4:A:284:Isoleucine->Alanine-3112033218337307991_r005 from train_data.hdf5
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r006 from train_data.hdf5
                -> mini-batch: 14 
loading variant 1CR4:A:200:Glutamine->Alanine-m3000895191136890148_r009 from train_data.hdf5
loading variant 1CR4:A:281:Histidine->Alanine-4426686988381237153_r001 from train_data.hdf5
loading variant 1CR4:A:113:Histidine->Alanine-7893153230422533547_r008 from train_data.hdf5
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484_r003 from train_data.hdf5
loading variant 1CR4:A:281:Histidine->Alanine-4426686988381237153_r008 from train_data.hdf5
                -> mini-batch: 15 
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r003 from train_data.hdf5
loading variant 1CR4:A:288:Glutamate->Alanine-m9090782543475449484_r009 from train_data.hdf5
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659_r009 from train_data.hdf5
loading variant 1CR4:A:200:Glutamine->Alanine-m3000895191136890148_r003 from train_data.hdf5
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659_r001 from train_data.hdf5
                -> mini-batch: 16 
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r003 from train_data.hdf5
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r006 from train_data.hdf5
loading variant 1CR4:A:284:Isoleucine->Alanine-3112033218337307991_r007 from train_data.hdf5
loading variant 1CR4:A:203:Histidine->Alanine-m5294706904669780162_r010 from train_data.hdf5
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r002 from train_data.hdf5
                -> mini-batch: 17 
loading variant 1CR4:A:281:Histidine->Alanine-4426686988381237153_r005 from train_data.hdf5
loading variant 1CR4:A:259:Isoleucine->Alanine-m5958799002590729623 from train_data.hdf5
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268_r009 from train_data.hdf5
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659_r006 from train_data.hdf5
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564_r009 from train_data.hdf5
                -> mini-batch: 18 
loading variant 1CR4:A:255:Tyrosine->Alanine-4015625884953984564 from train_data.hdf5
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659_r005 from train_data.hdf5
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700_r007 from train_data.hdf5
loading variant 1CR4:A:262:Aspartate->Asparagine-8570449461295847659_r003 from train_data.hdf5
loading variant 1CR4:A:281:Histidine->Alanine-4426686988381237153_r006 from train_data.hdf5
                -> mini-batch: 19 
loading variant 1CR4:A:259:Isoleucine->Alanine-m5958799002590729623_r008 from train_data.hdf5
loading variant 1CR4:A:171:Aspartate->Asparagine-7915049557026475700_r002 from train_data.hdf5
loading variant 1CR4:A:281:Histidine->Alanine-4426686988381237153 from train_data.hdf5
loading variant 1CR4:A:281:Histidine->Alanine-4426686988381237153_r003 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487_r003 from train_data.hdf5
                -> mini-batch: 20 
loading variant 1CR4:A:175:Alanine->Phenylalanine-3840484606403438268_r007 from train_data.hdf5
loading variant 1CR4:A:281:Histidine->Alanine-4426686988381237153_r007 from train_data.hdf5
loading variant 1CR4:A:259:Isoleucine->Alanine-m5958799002590729623_r002 from train_data.hdf5
loading variant 1CR4:A:187:Aspartate->Alanine-m6358192257465413487 from train_data.hdf5
                -> mini-batch: 21 
Traceback (most recent call last):
  File "learn.py", line 36, in <module>
    model.train(nepoch=50, divide_trainset=0.8, train_batch_size=5, num_workers=1)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/NeuralNet.py", line 321, in train
    self._train(index_train, index_valid, index_test,
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/NeuralNet.py", line 575, in _train
    self._epoch(0, "training", train_loader, False)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/NeuralNet.py", line 722, in _epoch
    self._metrics_output.process(pass_name, epoch_number, entry_names, output_values, target_values)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/models/metrics.py", line 49, in process
    metrics_exporter.process(pass_name, epoch_number, entry_names, output_values, target_values)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/models/metrics.py", line 76, in process
    probability = output_values[entry_index][1]
IndexError: list index out of range
cbaakman commented 11 months ago

It seems like you are combining regression (task='reg') with TensorboardBinaryClassificationExporter. This is not compatible. Either make it task='class', or remove that exporter from the list.

rgayatri commented 11 months ago

Like I mentioned earlier, please do not use regression for classification tasks from deeprank.learn.model3d import cnn_class

Unfortunately, this package isn't hardcoded for plug-n-play scenarios. We do not intend to make it that way, instead we have made it modular. The current repo would soon be archived as DeepRank2 gets finalized and released.

I do agree having an example workflow would be better to add. Thanks for the input.

imerelli commented 11 months ago

Thank you. It worked. Now I will create a new database with the unseen variants to make predictions using the last script. Then I will upload everything here in case you need it. A very naive question meanwhile: is it possibile to have (maybe download from the database?) also the pdb structures of the mutants to perform further analysis?

cbaakman commented 11 months ago

PDB structures are downloadable from the wwpdb. Instructions are here: http://www.wwpdb.org/ftp/pdb-ftp-sites

Not sure if that's what you mean.

imerelli commented 11 months ago

Hi, I was able to create the database unseen_data.hdf5 with the variants to model. Now I running the prediction step, but I have this problem. Can you help me?

$python learn2.py

========================================
=        DeepRank Data Set
=
=        Training data
=        -> unseen_data.hdf5
=
=
=
========================================

   Checking dataset Integrity

   Processing data set:
   Train dataset
loading variant 1CR4:A:171:Phenylalanine->Alanine-5230317638147088145 from unseen_data.hdf5
Traceback (most recent call last):
  File "learn2.py", line 14, in <module>
    data_set = DataSet(
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 178, in __init__
    self.process_dataset()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 252, in process_dataset
    self.get_input_shape()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 660, in get_input_shape
    feature, _ = self.load_one_variant(fname)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 931, in load_one_variant
    target = variant_data.get('targets/' + self.select_target)[()]
TypeError: 'NoneType' object is not subscriptable
(deeprank) [imerelli@slurmlogin DeepRank-Mut]$ cat learn2.py
from deeprank.learn import *
from deeprank.learn.model3d import cnn_class
from deeprank.models.metrics import OutputExporter, TensorboardBinaryClassificationExporter
import torch.optim as optim
import numpy as np

# preprocessed input data
hdf5_path = 'unseen_data.hdf5'

# output directory
out = './my_deeplearning_train/'

# declare the dataset instance
data_set = DataSet(
            hdf5_path,
                grid_info={
                            'number_of_points': (10, 10, 10),
                                    'resolution': (10, 10, 10)
                                        },
                    select_feature='all',
                        select_target='class',
                        )

# create the network
model = NeuralNet(data_set,cnn_class,model_type='3d',task='class',pretrained_model="best_valid_model.pth.tar",
                          cuda=False,metrics_exporters=[OutputExporter(out), TensorboardBinaryClassificationExporter(out)])

#model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
#                                  cuda=False,metrics_exporters=[OutputExporter(out)])

# change the optimizer (optional)
model.optimizer = optim.SGD(model.net.parameters(),
                                    lr=0.001,
                                    momentum=0.9,
                                    weight_decay=0.005)

# start the training, this will generate a model file named `best_valid_model.pth.tar`.
#model.train(nepoch=50, divide_trainset=0.8, train_batch_size=5, num_workers=1)
model.test()
imerelli commented 11 months ago

Concerning the PDB structures, I was wondering if I can download from DeepRank the pdb coordinates of the modelled proteins with the variants, for example to perform docking experiments after the prediction of their CLASS (BENIGN or PATHOGENIC).

cbaakman commented 11 months ago

I made a recent push to allow the model to run on unlabeled data. Unfortunately, deeprank does not generate structures. But the pymol mutagenesis feature might help you generate a structure of a variant.

imerelli commented 11 months ago

Ok, thank you. I downloaded the last version of the github, but I still get errors. If useful I can send you the files I'm using (generate2.py, learn2.py, table2.py).

$ git pull
remote: Enumerating objects: 28, done.
remote: Counting objects: 100% (28/28), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 28 (delta 16), reused 24 (delta 14), pack-reused 0
Unpacking objects: 100% (28/28), done.
From https://github.com/DeepRank/DeepRank-Mut
   6c6b072..edfed88  main       -> origin/main
Updating 6c6b072..edfed88
Fast-forward
 README.md                           |  4 ++--
 deeprank/learn/DataSet.py           | 42 ++++++++++++++++++++++++++++--------------
 deeprank/learn/NeuralNet.py         | 28 +++++++++++++++++++---------
 test/data/pdb/1CR4/1CR4.pdb         |  1 -
 test/generate/test_datagenerator.py | 23 +++++++++++++++++++++++
 test/operate/test_pdb.py            |  3 +++
 test/test_learn.py                  | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 144 insertions(+), 26 deletions(-)

$ python learn2.py

========================================
=        DeepRank Data Set
=
=        Training data
=        -> unseen_data.hdf5
=
=
=
========================================

   Checking dataset Integrity

   Processing data set:
   Train dataset
loading variant 1CR4:A:171:Phenylalanine->Alanine-5230317638147088145 from unseen_data.hdf5
Traceback (most recent call last):
  File "learn2.py", line 14, in <module>
    data_set = DataSet(
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 178, in __init__
    self.process_dataset()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 252, in process_dataset
    self.get_input_shape()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 662, in get_input_shape
    feature, _ = self.load_one_variant(fname)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 939, in load_one_variant
    target_group = variant_data['targets']
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/h5py/_hl/group.py", line 357, in __getitem__                                                                                                         
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 189, in h5py.h5o.open
KeyError: "Unable to synchronously open object (object 'targets' doesn't exist)"
cbaakman commented 11 months ago

Whoops! Looks like your output is slightly different from mine. Doesn't matter! I've pushed another patch that allows deeprank to handle even datasets without object 'targets'.

imerelli commented 11 months ago

Sorry, still not working. I'm attaching the files to reproduce the analysis. Files with the 2 suffis are related to the inference part, while the others to the learning part. archivio.tar.gz

$ python learn2.py 

========================================
=        DeepRank Data Set
=
=        Training data
=        -> unseen_data.hdf5
=
=
=
========================================

   Checking dataset Integrity

   Processing data set:
   Train dataset
loading variant 1CR4:A:171:Phenylalanine->Alanine-5230317638147088145 from unseen_data.hdf5
      Computing norm for unseen_data.hdf5
Traceback (most recent call last):
  File "learn2.py", line 14, in <module>
    data_set = DataSet(
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 178, in __init__
    self.process_dataset()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 256, in process_dataset
    self.get_norm()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 720, in get_norm
    self._read_norm()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/DataSet.py", line 747, in _read_norm
    norm.get()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/NormalizeData.py", line 43, in get
    self._extract_data()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/generate/NormalizeData.py", line 134, in _extract_data
    for tname, tval in target_group.items():
AttributeError: 'NoneType' object has no attribute 'items'

$ cat learn2.py 
from deeprank.learn import *
from deeprank.learn.model3d import cnn_class
from deeprank.models.metrics import OutputExporter, TensorboardBinaryClassificationExporter
import torch.optim as optim
import numpy as np

# preprocessed input data
hdf5_path = 'unseen_data.hdf5'

# output directory
out = './my_deeplearning_train/'

# declare the dataset instance
data_set = DataSet(
            hdf5_path,
                grid_info={
                            'number_of_points': (10, 10, 10),
                                    'resolution': (10, 10, 10)
                                        },
                    select_feature='all',
                        #select_target='class',
                        )

# create the network
model = NeuralNet(data_set,cnn_class,model_type='3d',task='class',pretrained_model="best_valid_model.pth.tar",
                          cuda=False,metrics_exporters=[OutputExporter(out), TensorboardBinaryClassificationExporter(out)])

#model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
#                                  cuda=False,metrics_exporters=[OutputExporter(out)])

# change the optimizer (optional)
model.optimizer = optim.SGD(model.net.parameters(),
                                    lr=0.001,
                                                                momentum=0.9,
                                                                                            weight_decay=0.005)

# start the training, this will generate a model file named `best_valid_model.pth.tar`.
#model.train(nepoch=50, divide_trainset=0.8, train_batch_size=5, num_workers=1)
model.test()
cbaakman commented 11 months ago

So this error happened during normalization. Sorry for not testing this. It should be fixed now.

imerelli commented 11 months ago

I see the computation go ahead a little, but now I have this errror:

$ python learn2.py

========================================
=        DeepRank Data Set
=
=        Training data
=        -> unseen_data.hdf5
=
=
=
========================================

   Checking dataset Integrity

   Processing data set:
   Train dataset
loading variant 1CR4:A:171:Phenylalanine->Alanine-5230317638147088145 from unseen_data.hdf5
      Computing norm for unseen_data.hdf5

   Data Set Info:
   Augmentation       : 10 rotations
   Training set       : 418 conformations
   Validation set     : 0 conformations
   Test set           : 0 conformations
   Number of channels : 31
   Grid Size          : 30, 30, 30

========================================
=        DeepRank Data Set
=
=        Training data
=        -> unseen_data.hdf5
=
=
=
========================================

   Checking dataset Integrity

   Processing data set:
   Train dataset
loading variant 1CR4:A:171:Phenylalanine->Alanine-5230317638147088145 from unseen_data.hdf5

   Data Set Info:
   Augmentation       : 10 rotations
   Training set       : 418 conformations
   Validation set     : 0 conformations
   Test set           : 0 conformations
   Number of channels : 31
   Grid Size          : 30, 30, 30
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
       BatchNorm3d-1       [-1, 31, 30, 30, 30]              62
            Conv3d-2       [-1, 31, 29, 29, 29]           7,719
       BatchNorm3d-3       [-1, 31, 29, 29, 29]              62
              ReLU-4       [-1, 31, 29, 29, 29]               0
            Conv3d-5       [-1, 64, 28, 28, 28]          15,936
       BatchNorm3d-6       [-1, 64, 28, 28, 28]             128
         MaxPool3d-7       [-1, 64, 14, 14, 14]               0
              ReLU-8       [-1, 64, 14, 14, 14]               0
            Conv3d-9       [-1, 64, 12, 12, 12]         110,656
      BatchNorm3d-10       [-1, 64, 12, 12, 12]             128
             ReLU-11       [-1, 64, 12, 12, 12]               0
          Flatten-12               [-1, 110592]               0
      BatchNorm1d-13               [-1, 110592]         221,184
           Linear-14                  [-1, 100]      11,059,300
             ReLU-15                  [-1, 100]               0
          Dropout-16                  [-1, 100]               0
           Linear-17                  [-1, 100]          10,100
             ReLU-18                  [-1, 100]               0
          Dropout-19                  [-1, 100]               0
           Linear-20                    [-1, 2]             202
================================================================
Total params: 11,425,477
Trainable params: 11,425,477
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 3.19
Forward/backward pass size (MB): 52.03
Params size (MB): 43.58
Estimated Total Size (MB): 98.81
----------------------------------------------------------------
Traceback (most recent call last):
  File "learn2.py", line 26, in <module>
    model = NeuralNet(data_set,cnn_class,model_type='3d',task='class',pretrained_model="best_valid_model.pth.tar",
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/NeuralNet.py", line 235, in __init__
    self.load_optimizer_params()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/NeuralNet.py", line 410, in load_optimizer_params
    self.optimizer.load_state_dict(self.state['optimizer'])
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/torch/optim/optimizer.py", line 770, in load_state_dict
    self.__setstate__({'state': state, 'param_groups': param_groups})
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/torch/optim/adamw.py", line 84, in __setstate__
    state_values[0]["step"]
KeyError: 'step'
cbaakman commented 11 months ago

Looks like torch has trouble interpreting your pretrained model. Maybe it works to retrain it. Otherwise, could you upload it here?

imerelli commented 11 months ago

I retrained the model, but I have the same error. Here a link to the pretrained model, it's too big for github https://www.dropbox.com/scl/fi/9t7soyhercnd565tcxa02/best_valid_model.pth.tar?rlkey=6grljyls1b7lv60xeq9kilg1q&dl=0

cbaakman commented 11 months ago

So it appeared that there was a bug in loading the optimizer settings from the preloaded model. But you don't even need an optimizer in step 3. So I made it optional in my last push.

imerelli commented 11 months ago

Okay, the computation of learn2.py ran smoothly. But now, where can I find the results? I mean, where are the predictions about whether my variants are benign or pathogenic?

cbaakman commented 11 months ago

Sorry, I forgot about that.

I pushed a fix. You'll need to pull, use the OutputExporter in your learn2.py script and run it again.

Output will go to a file in the output directory you set for it.

imerelli commented 11 months ago

Sorry, but after the last pull I have this error:

loading variant 1CR4:A:173:Phenylalanine->Isoleucine-m5953980920123066196_r001 from unseen_data.hdf5
                -> mini-batch: 418 
Traceback (most recent call last):
  File "learn2.py", line 40, in <module>
    model.test()
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/NeuralNet.py", line 373, in test
    self._epoch(0, "test", loader, False)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/learn/NeuralNet.py", line 742, in _epoch
    self._metrics_output.process(pass_name, epoch_number, entry_names, output_values, target_values)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/models/metrics.py", line 49, in process
    metrics_exporter.process(pass_name, epoch_number, entry_names, output_values, target_values)
  File "/opt/tools/deg/DeepRank-Mut/deeprank/models/metrics.py", line 70, in process
    loss = cross_entropy(tensor(output_values), tensor(target_values))
  File "/opt/tools/deg/miniforge3/envs/deeprank/lib/python3.8/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
IndexError: Target -1 is out of bounds.

Here my script:

$ cat learn2.py 
from deeprank.learn import *
from deeprank.learn.model3d import cnn_class
from deeprank.models.metrics import OutputExporter, TensorboardBinaryClassificationExporter
import torch.optim as optim
import numpy as np

# preprocessed input data
hdf5_path = 'unseen_data.hdf5'

# output directory
out = './my_deeplearning_train/'

# declare the dataset instance
data_set = DataSet(
            hdf5_path,
                grid_info={
                            'number_of_points': (10, 10, 10),
                                    'resolution': (10, 10, 10)
                                        },
                    select_feature='all',
                        #select_target='class',
                        )

# create the network
model = NeuralNet(data_set,cnn_class,model_type='3d',task='class',pretrained_model="best_valid_model.pth.tar",
                          cuda=False,metrics_exporters=[OutputExporter(out), TensorboardBinaryClassificationExporter(out)])

#model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
#                                  cuda=False,metrics_exporters=[OutputExporter(out)])

# change the optimizer (optional)
#model.optimizer = optim.SGD(model.net.parameters(),
#                                    lr=0.001,
#                                                                momentum=0.9,
#                                                                                            weight_decay=0.005)

# start the training, this will generate a model file named `best_valid_model.pth.tar`.
#model.train(nepoch=50, divide_trainset=0.8, train_batch_size=5, num_workers=1)
model.test()
cbaakman commented 11 months ago

Remove the TensorboardBinaryClassificationExporter from your script. It only works if you have a binary target value.