HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena
https://arxiv.org/abs/2306.15794
Apache License 2.0
602 stars 83 forks source link

Next token prediction - head code location / config to pass #21

Closed HuangPZ closed 1 year ago

HuangPZ commented 1 year ago

Hello! Great work and thanks for the opensource! I'm trying to check the pretraining and see the next token prediction on a dataset we have.

I find the standalone model probably better for loading pretrained weights and changing the dataloading for this purpose. But in the standalone model, there is no head provided for the next token prediction. It is said to be in the main hyenaDNA code but it is a bit hard for me to find where it lies and how I can modify the standalone .py for it. Can you help me with where this part of the code is or maybe how we can modify the config file for this purpose?

Also, I'm not an expert in genome study so I'm not familiar with the data structure if it is assumed to be well known. Since I don't have access to the main data you use but with some other single chromosome genome sequences, can you let me know the data structure so I can generate the correct files needed? (the .fa, I think consists of a line of info and a line of sequence for each record? And for .bed is the starting and ending position information?)

Thanks!

exnx commented 1 year ago

Hello! Thanks!

Unfortunately, a lot of the things you asked for are not supported (but the data is available and public!), and require some custom manipulation of the code. Fortunately it's not especially difficult to do, but it does require some time investment to get familiar with the architecture and training code. We're certainly open to contributions! (welcome to research!)

Pretraining data info.

For pretraining you'll definitely want to use the main repo code. There's a lot of little extra things that make pretraining efficient and high quality in the main repo.

The language head is used here, so you'll want to get familiar with this. This is what allows you to get an actual prediction of the next token, vs. getting just an embedding.

And maybe something like this would help you get started with loading and passing in sequences, but again, it needs modification.

Good luck!