a-r-j / ProteinWorkshop

Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
https://proteins.sh/
MIT License
190 stars 17 forks source link

Corrupted File in GeneOntology #95

Open AJB117 opened 1 month ago

AJB117 commented 1 month ago

Hi, I downloaded the GeneOntology dataset from the provided Zenodo link, but I came across this error during model evaluation:

PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

After some digging around, it looks like the file 1jhw_A.pt is causing this. I verified this with a simple torch.load in the unzipped GeneOntology directory. I'm currently getting around this this by adding "1JHW-A" to https://github.com/a-r-j/ProteinWorkshop/blob/main/proteinworkshop/datasets/go.py?plain=1#L288. Is this protein meant to be dropped? Thanks!

a-r-j commented 1 month ago

Hi @AJB117 thanks for flagging this, we'll try to update the Zenodo record. I don't believe it should be dropped, no. Excluding it is probably fine or you can go ahead and re-build the dataset from source (i.e. delete everything in processed/).