drorlab / DIPS

Database of Interacting Protein Structures (DIPS)
https://arxiv.org/abs/1807.01297
MIT License
91 stars 15 forks source link

Error in make_dataset and question about filtering #6

Open octavian-ganea opened 3 years ago

octavian-ganea commented 3 years ago

Hi,

Thanks for these great resources. I have 2 questions:

  1. Can you please detail what exactly are the filtering criteria used in prune_pairs.py and if these were already applied to the 42,826 pairs listed in the paper ?
  2. I tried to run make_dataset on a subset of DIPS, but got this error. Can you please help ? Thanks.
    
    $ python src/make_dataset.py ../raw/pdb/ ../interim
    2021-09-06 13:35:29,892 INFO 10990: making final data set from interim data
    2021-09-06 13:35:33,994 INFO 10990: 2566 requested keys, 0 produced keys, 2566 work keys
    2021-09-06 13:35:34,058 INFO 10990: Processing 2566 inputs.
    2021-09-06 13:35:34,058 INFO 10990: Sequential Mode.
    2021-09-06 13:35:34,058 INFO 10990: Reading ../raw/pdb/17/317d.pdb1.gz
    Traceback (most recent call last):
    File "src/make_dataset.py", line 45, in <module>
    main()
    File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1134, in __call__
    return self.main(*args, **kwargs)
    File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1059, in main
    rv = self.invoke(ctx)
    File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1401, in invoke
    return ctx.invoke(self.callback, **ctx.params)
    File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 767, in invoke
    return __callback(*args, **kwargs)
    File "src/make_dataset.py", line 30, in main
    pa.parse_all(input_dir, parsed_dir, num_cpus)
    File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/parse.py", line 57, in parse_all
    par.submit_jobs(parse, inputs, num_cpus)
    File "miniconda/miniconda3/lib/python3.8/site-packages/parallel.py", line 62, in submit_jobs
    out = [function(*args) for args in inputs]
    File "miniconda/miniconda3/lib/python3.8/site-packages/parallel.py", line 62, in <listcomp>
    out = [function(*args) for args in inputs]
    File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/parse.py", line 64, in parse
    df = struct.parse_structure(pdb_filename, one_model=False)
    File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/structure.py", line 61, in parse_structure
    biopy_structure = db.parse_biopython_structure(structure_filename)
    File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/database.py", line 59, in parse_biopython_structure
    biopy_structure = parser.get_structure('pdb', gzip.open(pdb_filename))
    File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 100, in get_structure
    self._parse(lines)
    File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 121, in _parse
    self.header, coords_trailer = self._get_header(header_coords_trailer)
    File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 139, in _get_header
    header_dict = _parse_pdb_header_list(header)
    File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/parse_pdb_header.py", line 199, in _parse_pdb_header_list
    pdbh_dict["structure_reference"] = _get_references(header)
    File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/parse_pdb_header.py", line 38, in _get_references
    if re.search(r"\AREMARK   1", l):
    File "miniconda/miniconda3/lib/python3.8/re.py", line 201, in search
    return _compile(pattern, flags).search(string)
    TypeError: cannot use a string pattern on a bytes-like object
vsomnath commented 2 years ago

Was able to resolve this error by running gzip -dr DIPS/raw/pdb to make sure all files are uncompressed before running make_dataset.py

octavian-ganea commented 2 years ago

yeah, me too using https://github.com/amorehead/DIPS-Plus/blob/main/project/datasets/builder/extract_raw_pdb_gz_archives.py