a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.02k stars 131 forks source link

can't read pdb ends with .pdb1.gz #343

Closed pengzhangzhi closed 1 year ago

pengzhangzhi commented 1 year ago

Hi,

I found an interesting naming convention in pdb, which causes bugs in graphin. When I download files from PDB, e.g., 6mhu.pdb1.gz, which ends with pdb1.gz, graphin can not read it because it accepts pdb.gz. https://github.com/a-r-j/graphein/blob/77a4d9ab90dd525876766e6d5b88f0bb7ac10274/graphein/protein/graphs.py#L103 However, these two formats are the same thing. Just curious about why PDB has such a name format. I also suggest U to support this format. I am happy to submit a PR for that. Below is the code to download the pdb files.


    def _download_pdb(self,savedir = ".data/all_biounits"):
        os.makedirs(savedir, exist_ok=True)

        LOGFILE = "pdb_logs"
        SERVER = "rsync.ebi.ac.uk::pub/databases/rcsb/pdb-remediated"
        PORT = "873"
        FTPPATH = "/data/biounit/PDB/divided/"

        # Construct the rsync command
        rsync_cmd = [
            "rsync",
            "-rlpt",
            '-v',
            '-z',
            "--delete",
            f"--port={PORT}",
            f"{SERVER}{FTPPATH}",
            savedir
        ]
        print(' '.join(rsync_cmd))
        subprocess.run(
            rsync_cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            check=True
        )

        return savedir```
pengzhangzhi commented 1 year ago

I found pandas pdb also does not support the pdb1.gz format...

a-r-j commented 1 year ago

HI @pengzhangzhi thanks for raising this issue. It's a good spot and it should be supported. I'd suggest opening another issue with biopandas as we will also need to add support there first.

Re: why this format exists, this is to distinguish biological assemblies.

pengzhangzhi commented 1 year ago

aha, thanks :)