evo-design / evo

Biological foundation modeling from molecular to genome scale
Apache License 2.0
867 stars 99 forks source link

Release of Pre-training Data Preprocess Scripts #41

Open KatarinaYuan opened 3 months ago

KatarinaYuan commented 3 months ago

Hi, If the data release for OpenGenome is still on-going, would it be possible to release the preprocess scripts for the data (no need to be exactly reproducible)?

cx0 commented 1 week ago

@KatarinaYuan @brianhie

I have tried to reproduce the OpenGenome dataset here by following the instructions in the paper. You can generate a functionally-equivalent dataset for your own training while waiting for the authors to release the exact filtering steps used for the dataset.