OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

Prepare OpenProteinSet #28

Open NZ99 opened 10 months ago

NZ99 commented 10 months ago

Download and prepare OpenProteinSet on the cluster, while deleting the old version on S3.

Multiple sequence alignments (MSAs) for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. Template hits are also provided for the PDB chains and 270,000 UniClust30 clusters chosen for maximal diversity and MSA depth.

https://registry.opendata.aws/openfold/

cmvcordova commented 9 months ago

Has this been looked into? I could take a look at it if someone could help me sanity check it.

NZ99 commented 9 months ago

I have not yet. Wanna collaborate over it @cmvcordova? I can start pulling the latest version on the cluster (there is a fairly old one already, but there is no point in using it if that decreases reproducibility) though I'm not 100% clear on what kind of preprocessing is needed.

cmvcordova commented 9 months ago

Let's do it! We can probably ping the rest of the team in the discord channel as we progress, to ensure we're on the right track

cmvcordova commented 9 months ago

Quick update:

We're currently facing issues with downloading the dataset on the ingress node. Zipped files are approximately 3.3 TB which exceeds any user's limit. After contacting the StabilityAI team, we'll redirect our approach to downloading directing to S3 using the spark cluster node instead.

pascalnotin commented 9 months ago

@NZ99 @cmvcordova -- I believe this is now completed based on latest conversation with Niccolo. Could you please confirm?

cmvcordova commented 8 months ago

Confirming OPS is on the cluster and accessible through s3://openbioml/