BioinfoMachineLearning / DIPS-Plus

The Enhanced Database of Interacting Protein Structures for Interface Prediction
https://zenodo.org/record/5134732
GNU General Public License v3.0
44 stars 8 forks source link

about fasta sequences #9

Closed XuBlack closed 2 years ago

XuBlack commented 2 years ago

When I run the command python3 project/datasets/builder/postprocess_pruned_pairs.py "$PROJDIR"/project/datasets/DB5/raw "$PROJDIR"/project/datasets/DB5/interim/pairs "$PROJDIR"/project/datasets/DB5/interim/external_feats "$PROJDIR"/project/datasets/DB5/final/raw --num_cpus 32 --source_type db5,

it occurs FileNotFoundError: [Errno 2] No such file or directory: '/opt/data/private/protein/DIPS-Plus/project/datasets/DB5/interim/external_feats/OF/work'.

It seems that the fasta file is missing. I couldn't find the codes about how to process fasta in your sharing. Is there some code missing on Guthub? Or I need to download it myself.

thanks.

XuBlack commented 2 years ago

The last error occurred in File "/DIPS-Plus/project/utils/utils.py", line 390, in find_fasta_sequences_for_pdb_file fasta_files = [os.path.join(external_feats_subdir, file) for file in os.listdir(external_feats_subdir)

amorehead commented 2 years ago

Hi, @XuBlack.

In our feature generation pipeline, when you run the script generate_hhsuite_features.py, it should write these FASTA sequence files to the directory you listed above (e.g., /opt/data/private/protein/DIPS-Plus/project/datasets/DB5/interim/external_feats/OF/work), assuming you provided generate_hhsuite_features.py with the value "$PROJDIR"/project/datasets/DB5/interim/external_feats for the CLI argument output_dir. This logic is housed in the atom3-py3 library DIPS-Plus makes use of. Specifically, you can find where the FASTA sequence files should be written to local storage here.

Since HH-suite3 makes use of FASTA sequence files as input, we have to extract the FASTA sequence for each input PDB file and write it to local storage before running HH-suite3 in generate_hhsuite_features.py. When you then run postprocess_pruned_pairs.py in the way that you did, this script should then assemble the full filepaths to each of the (previously-generated) FASTA sequence files corresponding to each PDB file you are postprocessing.

Can you confirm that for the DB5 dataset you have run generate_hhsuite_features.py before running postprocess_pruned_pairs.py? Also, can you verify whether generate_hhsuite_features.py indeed wrote the FASTA sequence files to "$PROJDIR"/project/datasets/DB5/interim/external_feats as expected?

XuBlack commented 2 years ago

Thank you for you reply! I have run generate_hhsuite_features.py before running postprocess_pruned_pairs.py. According to your prompt, after reading the source code of atom3-py3, I found that when I run the command python3 "$PROJDIR"/project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DB5/interim/parsed "$PROJDIR"/project/datasets/DB5/interim/parsed "$HHSUITE_DB" "$PROJDIR"/project/datasets/DB5/interim/external_feats --rank "$1" --size "$2" --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type db5 --write_file, it only generates a csv file. I need to change the parameter --write_file of the command to --read_file and run it again to generate hhsuite features.

When I ran it again with new parameters --read_file, another error occurred. For the DB5 dataset, when it runs make_dataset.py, the path of the generated pkl file is output_dir + '/' + db.get_pdb_code(pdb_filename)[1:3] + db.get_pdb_name(pdb_filename) + ".pkl", which is in lines 48-57 of parse.py in atom3-py3; but it is in lines 48-57 of conservation.py in atom3-py3. When parsing the path in lines 451-454, the path obtained is output_dir + '/' + db.get_pdb_code(pdb_filename) + db.get_pdb_name(pdb_filename) + ".pkl". The path of two is different.

After I rewrite the path, it can run successfully!

After it generates all hhsuite features successfully, I will try to run postprocess_pruned_pairs.py.

Thanks!

amorehead commented 2 years ago

@XuBlack,

Once your HH-suite features have finished generating, and you can verify that your complexes are postprocessed successfully by postprocess_pruned_pairs.py, would you be able to share with me which lines of code in either this repository or in the atom3-py3 repository's files you needed to change to generate DB5 complexes? You can either reply with your changes here, or, if you'd rather, you are also welcome to open a pull request to merge your changes into master.

I greatly appreciate your attention to detail as you use this pipeline! It seems I may have missed making some changes to the filepaths used in this project since updating the DeepInteract repository. I will try to get those corrected once we know exactly which filepaths are currently incorrect for the DB5 dataset.

XuBlack commented 2 years ago

I have created a pull request in the atom3-py3 repository.

And thank you for your share another repository DeepInteract. I have lots of interest in it and will learn more about it.

amorehead commented 2 years ago

@XuBlack, I just finished upgrading DIPS-Plus' (this repository's) version of atom3-py3 to include the bug fix you authored over in the atom3-py3 repository. If you encounter any further issues in the construction of filepaths, please let us know. We appreciate your pointing this bug out.