Open NZ99 opened 10 months ago
Has this been looked into? I could take a look at it if someone could help me sanity check it.
I have not yet. Wanna collaborate over it @cmvcordova? I can start pulling the latest version on the cluster (there is a fairly old one already, but there is no point in using it if that decreases reproducibility) though I'm not 100% clear on what kind of preprocessing is needed.
Let's do it! We can probably ping the rest of the team in the discord channel as we progress, to ensure we're on the right track
Quick update:
We're currently facing issues with downloading the dataset on the ingress node. Zipped files are approximately 3.3 TB which exceeds any user's limit. After contacting the StabilityAI team, we'll redirect our approach to downloading directing to S3 using the spark cluster node instead.
@NZ99 @cmvcordova -- I believe this is now completed based on latest conversation with Niccolo. Could you please confirm?
Confirming OPS is on the cluster and accessible through s3://openbioml/
Download and prepare OpenProteinSet on the cluster, while deleting the old version on S3.
https://registry.opendata.aws/openfold/