Closed amorehead closed 1 year ago
Are you downloading the raw structures from the PDB? IIRC the download tool should rename .ent
s to .pdb
automatically.
It looks like there are only .gz and .mmtf files in my raw pdb
download directory:
find proteinworkshop/data/pdb -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u
gz
mmtf
Yep, that's correct actually. No idea how this change happened: https://github.com/a-r-j/ProteinWorkshop/blame/0e1cc2e370a977704ec93b2f8b2cd7d118a768e0/proteinworkshop/datasets/fold_classification.py#L167C26-L167C26
The arg is hardcoded and should default to .mmtf
or .mmtf.gz
.
Should we push this fix to main
, or should I simply change it in my branch and then handle it in an upcoming PR?
Looks like this is also hard-coded for ASTRAL: https://github.com/a-r-j/ProteinWorkshop/blob/0e1cc2e370a977704ec93b2f8b2cd7d118a768e0/proteinworkshop/datasets/astral.py#L190
Testing the changes locally and will make a small PR.
For ASTRAL it needs to be hardcoded; the structures are only provided in PDB/ent format at this point in time AFAIK.
Actually, on a closer examination I think the .ent
extension is correct for FoldClassification. It also uses structures from ASTRAL. Let me investigate.
Related to https://github.com/a-r-j/ProteinWorkshop/pull/53, how do I download the ASTRAL dataset? When I try using the workshop
CLI to download it, I am shown the error:
workshop download: error: argument dataset: invalid choice: 'astral' (choose from 'pdb', 'afdb_rep_v4', 'afdb_rep_dark_v4', 'afdb_swissprot', 'afdb_swissprot_v4', 'afdb_uniprot_v4', 'esmatlas', 'highquality_clust30', 'a_thaliana', 'c_albicans', 'c_elegans', 'd_discoideum', 'd_melanogaster', 'd_rerio', 'e_coli', 'g_max', 'h_sapiens', 'm_jannaschii', 'm_musculus', 'o_sativa', 'r_norvegicus', 's_cerevisiae', 's_pombe', 'z_mays', 'antibody_developability', 'cath', 'ccpdb', 'ccpdb_ligands', 'ccpdb_metal', 'ccpdb_nucleic', 'ccpdb_nucleotides', 'deep_sea_proteins', 'ec_reaction', 'fold_classification', 'fold_fold', 'fold_family', 'fold_superfamily', 'go-bp', 'go-cc', 'go-mf', 'masif_site', 'metal_3d', 'ptm')
It's downloaded automatically in the datamodule if no copy is found in your data_dir
I've been having difficulties downloading it, and now I think I know why. I believe we need to call download_structures()
in setup()
in addition to download_data_files()
: https://github.com/a-r-j/ProteinWorkshop/blob/0e1cc2e370a977704ec93b2f8b2cd7d118a768e0/proteinworkshop/datasets/fold_classification.py#L156
For some reason, download()
itself doesn't get called for this data module, at least not when I would expect it to.
Ah good spot! Yes, you're right. I think we're overwriting the base class setup()
which would call download()
in the FoldClassification datamodule,. I think we just need to add download()
to FoldClassifcationDataModule.setup()
.
With this change implemented, my original issue for downloading the fold_fold
dataset should be resolved.
It's worth noting that for now my workaround involves calling download_structures()
manually here: https://github.com/a-r-j/ProteinWorkshop/blob/ceabdaec3bf7e61292f033507bd092bff0d7c61a/proteinworkshop/datasets/fold_classification.py#L157 This was merged in with my UMAP embedding code.
When attempting to select
dataset=fold_fold
, I received the following file extension error for.ent
s:For context, I am selecting
task=multiclass_graph_classification
as well.