fudan-generative-vision / dynamicPDB

Dynamic PDB datasets
387 stars 71 forks source link

A successful test of data downloading. #6

Open Kaihui-Cheng opened 1 week ago

Kaihui-Cheng commented 1 week ago
  1. Make sure you have Git LFS installed:

    
    sudo apt-get install git-lfs 
    # Initialize Git LFS
    git lfs install
  2. Navigate to your DATA_ROOT and clone the source:

    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/datasets/fudan-generative-vision/dynamicPDB.git dynamicPDB_raw
  3. Download data with a specific protein_id, for example 1a62_A:

    cd dynamicPDB_raw
    git lfs pull --include="{protein_id}/*"
  4. Merge the split-volume compression into one file and then unzip the .tar.gz file:

    cat {protein_id}/{protein_id}.tar.gz.part* > {protein_id}/{protein_id}.tar.gz
    cd ${Your Storage Root}
    mkdir dynamicPDB  # ignore if directory exists
    tar -xvzf dynamicPDB_raw/{protein_id}/{protein_id}.tar.gz -C dynamicPDB

    Ok! Now we have the simulation data for protein_id. Note: Sufficient storage space is required for the data. For 1a62_A, 33GB is needed for the unzipped files and 24GB for the zipped files.

meatball1982 commented 1 week ago

Dear Kaihui-Cheng: 01: There are 10 pdb ID in 1a62_A, ..., 1bq8_A. If you are so kind to provide a list of all the PDB ID(12.6k filtered proteins) in all your dataset(only PDB ID). Then we( most readers of your paper) can choose the specific PDB to download. 02: In README "we have decided to provide the 100ns simulation data for all proteins for online download". Still, I see no instruction to download the 100ns of all protein. Could you help me about that. Thank you so much and I am looking forward of your reply. Best M

zqcai19 commented 3 days ago

@meatball1982 Hi! Thank you for your valuable suggestions.

  1. We are still working on uploading the complete dataset, as its size is significantly large. However, we can provide a list on ModelScope to record the currently available protein data. This list may make it easier for users to choose the specific PDBs they want to download.
  2. The instruction described above by @Kaihui-Cheng is for downloading the 100ns simulation data, which we are actively uploading. If you would like to download all currently available protein data at once, you can use the command git lfs pull (without specifying --include="{protein_id}/*") in step 3.

Please let us know if you have any other questions or suggestions.