BioinfoMachineLearning / DIPS-Plus

The Enhanced Database of Interacting Protein Structures for Interface Prediction
https://zenodo.org/record/5134732
GNU General Public License v3.0
44 stars 8 forks source link

about Process the raw PDB data into associated pair files #12

Closed lijiashan2020 closed 2 years ago

lijiashan2020 commented 2 years ago

7

I have a similar problem, when I run the command

python3 project/datasets/builder/make_dataset.py project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim --num_cpus 28 --source_type rcsb --bound

It seems that it can run successfully and generate all PDB's .pkl file, however, it will not continue to run and no error will be reported. This command seems to require generating project/datasets/DIPS/interim/pairs, which will be uesd in the next command. When I run the command,

python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pairs project/datasets/DIPS/filters project/datasets/DIPS/interim/pairs-pruned --num_cpus 28

the result is as follows:

Using backend: pytorch
Usage: prune_pairs.py [OPTIONS] PAIR_DIR TO_KEEP_DIR OUTPUT_DIR
Try "prune_pairs.py --help" for help.
Error: Invalid value for "PAIR_DIR": Path "project/datasets/DIPS/interim/pairs" does not exist.

What can I do for this?

thanks

amorehead commented 2 years ago

Hi, @lijiashan2020. My first question is, are you able to confirm that the directory the prune_pairs.py script is referring to is in fact already created and populated with .pkl files from running make_dataset.py?

lijiashan2020 commented 2 years ago

11

Thank you for your reply! I reconfirmed the directory is already created and populated with .pkl files, however, deadlock still happens. I borrowed from another questioner's solution, splitting up to process the make_dataset.py script, first mkdir six different folder and move file in new folder, run make_dataset.py script separately, some new files are generated in the 'pairs' and 'complexes' folder that did not exist before, then a new error appears as follow:

multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/multiprocess/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/parallel.py", line 85, in submit_helper
    raise e
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/parallel.py", line 79, in submit_helper
    return function(*inputs)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/atom3/pair.py", line 98, in complex_to_pairs
    pairs, num_subunits = get_pairs(complex)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/atom3/pair.py", line 141, in get_pairs_param
    return get_pairs(neighbor_def, complex, type, unbound, nb_fn, full)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/atom3/pair.py", line 156, in get_pairs
    _get_rcsb_pairs(neighbor_def, complex, unbound, nb_fn, full)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/atom3/pair.py", line 190, in _get_rcsb_pairs
    df = pd.read_pickle(pkl_filename)
  File "/home/jiashan/.local/lib/python3.9/site-packages/pandas/io/pickle.py", line 222, in read_pickle
    return pc.load(handles.handle, encoding=None)
  File "/home/jiashan/.local/lib/python3.9/site-packages/pandas/compat/pickle_compat.py", line 274, in load
    return up.load()
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/pickle.py", line 1210, in load
    dispatch[key[0]](self)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/pickle.py", line 1535, in load_stack_global
    self.append(self.find_class(module, name))
  File "/home/jiashan/.local/lib/python3.9/site-packages/pandas/compat/pickle_compat.py", line 206, in find_class
    return super().find_class(module, name)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/pickle.py", line 1579, in find_class
    return _getattribute(sys.modules[module], name)[0]
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/pickle.py", line 331, in _getattribute
    raise AttributeError("Can't get attribute {!r} on {!r}"
AttributeError: Can't get attribute '_unpickle_block' on <module 'pandas._libs.internals' from '/home/jiashan/.local/lib/python3.9/site-packages/pandas/_libs/internals.cpython-39-x86_64-linux-gnu.so'>
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/extendplus/jiashan/DIPS_plus/project/datasets/builder/make_dataset.py", line 54, in <module>
    main()
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/extendplus/jiashan/DIPS_plus/project/datasets/builder/make_dataset.py", line 47, in main
    pair.all_complex_to_pairs(complexes, source_type, get_pairs, pairs_dir, num_cpus)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/atom3/pair.py", line 82, in all_complex_to_pairs
    par.submit_jobs(complex_to_pairs, inputs, num_cpus)
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/parallel.py", line 60, in submit_jobs
    out = res.get()
  File "/extendplus/app/miniconda3/envs/py39/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
AttributeError: Can't get attribute '_unpickle_block' on <module 'pandas._libs.internals' from '/home/jiashan/.local/lib/python3.9/site-packages/pandas/_libs/internals.cpython-39-x86_64-linux-gnu.so'>

Could you please help me with this error? Thanks!

amorehead commented 2 years ago

@lijiashan2020,

I'm glad to hear you were able to progress through this issue further! I believe the error you are seeing comes from version incompatibility between Pandas and your Python version's Pickle module (each version of Python has its own version of Pickle, I believe). I would first check to make sure you are using a compatible version of Pandas for Python 3.8 (see this post for reference: https://stackoverflow.com/a/71090354). You may need to downgrade Pandas a few versions such that the Pickle module Pandas uses can correctly load the Pickle files Python is creating in make_dataset.py.

amorehead commented 2 years ago

@lijiashan2020,

In particular, I would recommend that you try to downgrade Pandas to version 1.2.4, to see if this fixes what I believe to be an incompatibility between Python 3.8's Pickle module and the latest version of Pandas.

lijiashan2020 commented 2 years ago

Thank you very much for your reply! I will reinstall the environment according to the module version you recommended, your open source work has benefited me a lot. Thank you again for your help!

amorehead commented 2 years ago

@lijiashan2020,

Let me know if this works for you. I am happy to help where I can!

lijiashan2020 commented 2 years ago

I'm very happy to tell you that after downgrading the python version, the program will not continue to report errors! But when it is about to run successfully, the program will suddenly Aborted!

2022-04-06 22:38:28,877 INFO 86949: For complex 5pm8.pdb1 found 0 pairs out of 1 chains
2022-04-06 22:38:28,877 INFO 86949: Working on 5pmn.pdb1
2022-04-06 22:38:28,888 INFO 86949: For complex 5pmn.pdb1 found 0 pairs out of 1 chains
2022-04-06 22:38:28,888 INFO 86949: Working on 5pmj.pdb1
2022-04-06 22:38:28,894 INFO 86949: For complex 5pmj.pdb1 found 0 pairs out of 1 chains

Aborted!

I am very sorry to disturb you with many problems, and I will also try my best to solve these problems. Thanks!

amorehead commented 2 years ago

@lijiashan2020,

This sounds like a core dump happened somewhere within your Python script's execution. I also noticed that your script is looking at processing pairs for single amino acid chains, which at least at first glance does not seem to make sense to me. Typically, as I recall, this script would be looking for at least one pair for each collection of chains (possibly two or more chains). Seeing it find only one chain in your "complexes" makes me suspect that some previous data processing did not complete successfully. I would recommend, with your new version of Pandas, rerunning the entire data processing pipeline (if possible) to ensure that the version of Pandas you used before did not result in unexpected (incorrect) processing of each RCSB protein complex. I hope this information helps.

lijiashan2020 commented 2 years ago

Thank you very much for your recent help! I found that I didn't delete the previously generated result file when I rerun, which made it invalid even if I divided the data into several parts. I can run successfully now!

amorehead commented 2 years ago

@lijiashan2020,

I am glad to hear it!