Cannot retrieve some cluster files

L40S38 commented 2 years ago

Hi.

I executed the command to evaluate on the Vertex dataset or the ProSPECCTS dataset. But I found almost the same error like below.

(I exported as $STRUCTURE_DATA_DIR = $DEEPLYTOUGH/datasets_structure. Also, I omitted the path to the repository)

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5324k 100 5324k 0 0 1471k 0 0:00:03 0:00:03 --:--:-- 1472k INFO:datasets.vertex:Preprocessing: downloading data and extracting pockets, this will take time. INFO:root:cluster file path: DeeplyTough/datasets_structure/bc-30.out WARNING:root:Cluster definition not found, will download a fresh one. WARNING:root:However, this will very likely lead to silent incompatibilities with any old 'pdbcode_mappings.pickle' files! Please better remove those manually. Traceback (most recent call last): File "DeeplyTough/deeplytough/scripts/vertex_benchmark.py", line 68, in main() File "DeeplyTough/deeplytough/scripts/vertex_benchmark.py", line 32, in main database.preprocess_once() File "DeeplyTough/deeplytough/datasets/vertex.py", line 49, in preprocess_once clusterer = RcsbPdbClusters(identity=30) File "DeeplyTough/deeplytough/misc/utils.py", line 248, in init self._fetch_cluster_file() File "DeeplyTough/deeplytough/misc/utils.py", line 262, in _fetch_cluster_file self._download_cluster_sets(cluster_file_path) File "DeeplyTough/deeplytough/misc/utils.py", line 253, in _download_cluster_sets request.urlretrieve(f'https://cdn.rcsb.org/resources/sequence/clusters/bc-{self.identity}.out', cluster_file_path) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 248, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 223, in urlopen return opener.open(url, data, timeout) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 532, in open response = meth(req, response) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 642, in http_response 'http', request, response, code, msg, hdrs) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 570, in error return self._call_chain(args) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 504, in _call_chain result = func(args) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 650, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 404: Not Found

I successed when evaluated on the TOUGH-M1 dataset, so I'm afraid of some URL to the Vertex and ProSPECCTS data is expired. Would you mind check about that?

JoshuaMeyers commented 2 years ago

Hey @L40S38, thanks for opening a ticket. It seems this is due to the RCSB PDBs cluster file moving. See https://www.rcsb.org/news/feature/6205750d8f40f9265109d39f (in fact its discontinued and changed, so this may even have scientific implications for DeeplyTough)

I will have a look into it. If you don't need to use the cluster file (e.g. if you are happy with random splitting, or you just want to run the existing models) I believe you can just specify a different splitting method.

L40S38 commented 1 year ago

Hi, Long time no see

I solved this problem, so I tell you the way.

・the URL to retrieve the cluster file (in deeplytough/misc/utils.py) should be changed as below.

https://cdn.rcsb.org/resources/sequence/clusters/clusters-by-entity-{self.identity}.txt

・Also, the expression of sequences in the cluster file was changed to {proteinid}\{entity_id} from {proteinid}\{chain_id}. Then I couldn't get cluster id of most proteins. Thus, you should get the entity id of the chain in some way and search the cluster it belongs, or split in other way (e.g. uniprot_folds) In my case, when getting pdbcode_mappings.pickle in preprocessing before using TOUGH-M1 dataset, get entity_id in pdb_chain_to_uniprot in deeplytough/datasets/toughm1.py

def pdb_chain_to_uniprot(pdb_code, query_chain_id):
            """
            Get pdb chain mapping to uniprot accession using the pdbe api
            """
            result = 'None'
            entity_id = 'None'
            r = requests.get(f'http://www.ebi.ac.uk/pdbe/api/mappings/uniprot/{pdb_code}')
            fam = r.json()[pdb_code]['UniProt']

            for fam_id in fam.keys():
                for chain in fam[fam_id]['mappings']:
                    if chain['chain_id'] == query_chain_id:
                        if result != 'None' and fam_id != result:
                            logger.warning(f'DUPLICATE {fam_id} {result}')
                        result = fam_id
                        entity_id = chain['entity_id']
            if result == 'None':
                logger.warning(f'No uniprot accession found for {pdb_code}: {query_chain_id}')
            return entity_id

I wish you well in your execution.

BenevolentAI / DeeplyTough

Cannot retrieve some cluster files #20