a-r-j / ProteinWorkshop

Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
https://proteins.sh/
MIT License
200 stars 16 forks source link

`fold_fold` dataset cannot be downloaded #54

Closed amorehead closed 1 year ago

amorehead commented 1 year ago

When attempting to select dataset=fold_fold, I received the following file extension error for .ents:

Invalid format: ent. Must be 'pdb' or 'mmtf'.

For context, I am selecting task=multiclass_graph_classification as well.

a-r-j commented 1 year ago

Are you downloading the raw structures from the PDB? IIRC the download tool should rename .ents to .pdb automatically.

amorehead commented 1 year ago

It looks like there are only .gz and .mmtf files in my raw pdb download directory:

find proteinworkshop/data/pdb -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u
gz
mmtf
a-r-j commented 1 year ago

Yep, that's correct actually. No idea how this change happened: https://github.com/a-r-j/ProteinWorkshop/blame/0e1cc2e370a977704ec93b2f8b2cd7d118a768e0/proteinworkshop/datasets/fold_classification.py#L167C26-L167C26

The arg is hardcoded and should default to .mmtf or .mmtf.gz.

amorehead commented 1 year ago

Should we push this fix to main, or should I simply change it in my branch and then handle it in an upcoming PR?

amorehead commented 1 year ago

Looks like this is also hard-coded for ASTRAL: https://github.com/a-r-j/ProteinWorkshop/blob/0e1cc2e370a977704ec93b2f8b2cd7d118a768e0/proteinworkshop/datasets/astral.py#L190

a-r-j commented 1 year ago

Testing the changes locally and will make a small PR.

For ASTRAL it needs to be hardcoded; the structures are only provided in PDB/ent format at this point in time AFAIK.

a-r-j commented 1 year ago

Actually, on a closer examination I think the .ent extension is correct for FoldClassification. It also uses structures from ASTRAL. Let me investigate.

amorehead commented 1 year ago

Related to https://github.com/a-r-j/ProteinWorkshop/pull/53, how do I download the ASTRAL dataset? When I try using the workshop CLI to download it, I am shown the error:

workshop download: error: argument dataset: invalid choice: 'astral' (choose from 'pdb', 'afdb_rep_v4', 'afdb_rep_dark_v4', 'afdb_swissprot', 'afdb_swissprot_v4', 'afdb_uniprot_v4', 'esmatlas', 'highquality_clust30', 'a_thaliana', 'c_albicans', 'c_elegans', 'd_discoideum', 'd_melanogaster', 'd_rerio', 'e_coli', 'g_max', 'h_sapiens', 'm_jannaschii', 'm_musculus', 'o_sativa', 'r_norvegicus', 's_cerevisiae', 's_pombe', 'z_mays', 'antibody_developability', 'cath', 'ccpdb', 'ccpdb_ligands', 'ccpdb_metal', 'ccpdb_nucleic', 'ccpdb_nucleotides', 'deep_sea_proteins', 'ec_reaction', 'fold_classification', 'fold_fold', 'fold_family', 'fold_superfamily', 'go-bp', 'go-cc', 'go-mf', 'masif_site', 'metal_3d', 'ptm')
a-r-j commented 1 year ago

It's downloaded automatically in the datamodule if no copy is found in your data_dir

https://github.com/a-r-j/ProteinWorkshop/blob/0e1cc2e370a977704ec93b2f8b2cd7d118a768e0/proteinworkshop/datasets/fold_classification.py#L123

amorehead commented 1 year ago

I've been having difficulties downloading it, and now I think I know why. I believe we need to call download_structures() in setup() in addition to download_data_files(): https://github.com/a-r-j/ProteinWorkshop/blob/0e1cc2e370a977704ec93b2f8b2cd7d118a768e0/proteinworkshop/datasets/fold_classification.py#L156

amorehead commented 1 year ago

For some reason, download() itself doesn't get called for this data module, at least not when I would expect it to.

a-r-j commented 1 year ago

Ah good spot! Yes, you're right. I think we're overwriting the base class setup() which would call download() in the FoldClassification datamodule,. I think we just need to add download() to FoldClassifcationDataModule.setup().

https://github.com/a-r-j/ProteinWorkshop/blob/0e1cc2e370a977704ec93b2f8b2cd7d118a768e0/proteinworkshop/datasets/base.py#L75

amorehead commented 1 year ago

With this change implemented, my original issue for downloading the fold_fold dataset should be resolved.

amorehead commented 1 year ago

It's worth noting that for now my workaround involves calling download_structures() manually here: https://github.com/a-r-j/ProteinWorkshop/blob/ceabdaec3bf7e61292f033507bd092bff0d7c61a/proteinworkshop/datasets/fold_classification.py#L157 This was merged in with my UMAP embedding code.