Zenodo PPI zip file and PPI extraction scripts

rakeshr10 commented 4 months ago

Hi @anton-bushuiev,

The zenodo file ppi_6A.zip file does n’t seem to be in ‘.zip’ format. I am unable to extract it.

Also the download.sh file and ppi_extractor.py scripts for PPI extraction does n’t work. The link for pdb rsync is not working for download and many of the threads in ppi_extractor.py gives error after running for sometime.

Would be great if you could fix this.

Regards Rakesh

martinpacesa commented 4 months ago

I am having the same issue!

anton-bushuiev commented 4 months ago

Hi, @rakeshr10 and @martinpacesa,

Thank you for reporting the issue!

The ppi_6A.zip archive on Zenodo was indeed corrupted. I have uploaded a new, fixed file to Zenodo, and apologize for the difficulties. I have also improved the description of how to use and download the dataset in README.md (see the new "How to use" section). Now, downloading from Zenodo can be done automatically.
In the download.sh script, you may need to adjust the domain name (rsync.ebi.ac.uk by default) depending on your location (see the corresponding PDB docs). I have extended the corresponding comment in the script. For example, downloading from the US server does not work for me because I am located in Europe. That is why the UK server, suitable for Europe, is used by default in the script. Please let me know if it is not the source of the issue.
Could you please provide more details on the error caused by ppi_extractor.py? Is it a system error or a logical error in the script? If it is a system error, did you try reducing the number of processes (--max_workers argument)? If it is a logical error, could you please copy-paste the error trace here?

Please let me know if there are any other issues or unclarities.

Best, Anton

rakeshr10 commented 4 months ago

Hi @anton-bushuiev ,

Thanks for looking into this.

1) The zip file now seems to be fine, when I download directly from zenodo. But when I download using download_from_zenodo, it still seems to not complete the download and is corrupted, also I noticed I have to be in the root dir of ppiref for it to download in the data directory of ppiref.

2) For me the PDB urls listed in colabfold work fine. https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh

3) This was the error for ppi_extractor.py. I had dr_sasa build in external directory of ppiref.

Collecting input files: 100%|██████████| 210964/210964 [00:00<00:00, 2627998.66it/s] Collecting processed files: 0it [00:00, ?it/s] 1%| | 1273/210964 [12:58<30:39:45, 1.90it/s][PosixPath('/data/results/openproteinset/PPIDatabase/data/pdb/divided/a0/1a07.pdb')] generated an exception: [Errno 2] No such file or directory: '/usr/local/envs/colabfold/lib/python3.9/site-packages/ppiref-1.0-py3.9.egg/external/dr_sasa_n/build/dr_sasa' concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/envs/colabfold/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, *call_item.kwargs) File "/usr/local/envs/colabfold/lib/python3.9/site-packages/ppiref-1.0-py3.9.egg/ppiref/extraction.py", line 245, in _extract_chunk_paths self.extract(pdb_path) File "/usr/local/envs/colabfold/lib/python3.9/site-packages/ppiref-1.0-py3.9.egg/ppiref/extraction.py", line 140, in extract buried_residues, bsa = self.dr_sasa(pdb_path, ipartners) File "/usr/local/envs/colabfold/lib/python3.9/site-packages/ppiref-1.0-py3.9.egg/ppiref/surface.py", line 75, in call subprocess.run(command, cwd=self.tmp_dir, check=True, capture_output=not self.verbose) File "/usr/local/envs/colabfold/lib/python3.9/subprocess.py", line 505, in run with Popen(popenargs, **kwargs) as process: File "/usr/local/envs/colabfold/lib/python3.9/subprocess.py", line 951, in init self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/local/envs/colabfold/lib/python3.9/subprocess.py", line 1837, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/envs/colabfold/lib/python3.9/site-packages/ppiref-1.0-py3.9.egg/external/dr_sasa_n/build/dr_sasa' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/envs/colabfold/lib/python3.9/site-packages/ppiref-1.0-py3.9.egg/ppiref/extraction.py", line 234, in extract_parallel future.result() File "/usr/local/envs/colabfold/lib/python3.9/concurrent/futures/_base.py", line 439, in result return self.get_result() File "/usr/local/envs/colabfold/lib/python3.9/concurrent/futures/_base.py", line 391, in get_result raise self._exception FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/envs/colabfold/lib/python3.9/site-packages/ppiref-1.0-py3.9.egg/external/dr_sasa_n/build/dr_sasa

Can you also put in the description of ppi_6A how the files were generated? I noticed some of the PDB files had very small interfaces, I am curious if BSA was used to filter these.

Regards Rakesh

anton-bushuiev commented 4 months ago

Hi, @rakeshr10,

Thank you for your comments.

The zip file now seems to be fine, when I download directly from zenodo. But when I download using download_from_zenodo, it still seems to not complete the download and is corrupted, also I noticed I have to be in the root dir of ppiref for it to download in the data directory of ppiref.

I have just tried to follow the steps from README on a different machine and it worked fine. Please make sure you follow the steps from README. I run the following commands and got all the PPIs downloaded and unpacked in PPIRef/ppiref/data/ppiref/ppi_6A, regardless my location in the file system:

# Create env
conda create -n ppiref python=3.10
conda activate ppiref

# Clone and install ppiref
git clone https://github.com/anton-bushuiev/PPIRef.git
cd PPIRef; pip install -e .

# Change directory to some random location
cd tests

# Download and upack
python
from ppiref.utils.misc import download_from_zenodo
download_from_zenodo('ppi_6A.zip')

# Verify
find ../ppiref/data/ppiref/ppi_6A -name "*.pdb*" | wc -l
> 745501

But when I download using download_from_zenodo, it still seems to not complete the download and is corrupted

Could you please specify what you mean by corrupted?

also I noticed I have to be in the root dir of ppiref for it to download in the data directory of ppiref.

It should not be the case. Please note that the destination_folder variable in the download_from_zenodo is independent on your location.

This was the error for ppi_extractor.py. I had dr_sasa build in external directory of ppiref.

Having dr_sasa build in external directory is also fine but then you need to override the DR_SASA_PATH variable in ppiref/definitions.py

Can you also put in the description of ppi_6A how the files were generated? I noticed some of the PDB files had very small interfaces, I am curious if BSA was used to filter these.

I have added the information in README. The ppi_6A.zip archive contains the PPIs that have at least one 6A contact between heavy atoms. If you are interested only in filtered PPIs, you can find the corresponding ids in the ppiref_6A_filtered.json. Please also note that all the .pdb files in ppi_6A contain headers with statistics, including BSA.

Let me know if the comments above do not solve your issue.

Best, Anton

anton-bushuiev commented 2 months ago

Closing due to inactivity. Should be fixed now, please reopen if not.

anton-bushuiev / PPIRef

Zenodo PPI zip file and PPI extraction scripts #2