HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena
https://arxiv.org/abs/2306.15794
Apache License 2.0
532 stars 74 forks source link

Human Reference Genome questions #23

Closed kirill-vishniakov closed 8 months ago

kirill-vishniakov commented 8 months ago

Hello, thank you very much for open-sourcing your work. I have a couple of questions about the Human Genome Reference dataset.

[Q1] In your instructions for downloading Human Reference Genome you mention:

First step is download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file). However, you'll need to have a GCP account to download the exact files we used (from the Enformer), and it cost a little to download. At some point we'll try to upload somewhere to share that data.

As far as I know Enformer was using a mix of Basenji dataset and Human Reference Genome. Specifically in their paper they say:

We modified the Basenji2 dataset by extending the input sequence to 196,608bp from the original 131,072bp using the hg38 reference genome.

In the filenames for Human Reference Genome you also have reference to Basenji, i.e.

Download fasta (.fa format) file (of the entire human genome) into hyena-dna/data/hg38. ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically gsutil -u hai-gcp-hippo cp gs://basenji_barnyard/hg38.ml.fa.gz ./ && gunzip hg38.ml.fa.gz

Hyena-DNA paper does not contain any reference to Basenji. So I wonder how do these two datasets relate to each other? Do you train on mix of them as Enformer or only on Human Reference Genome? Also in the Appendix you mention:

For pretraining, we use a single human reference genome (Genome Reference Consortium, 2013), and leverage the training and validation intervals (start and end) from (Avsec et al., 2021).

Does it imply that you use completely the same data as Enformer and the same train/val splits?

[Q2] Since accessing the data on GCP incurs a cost, you mentioned plans to make the data more accessible. Do you have a timeline for this? Maybe there is script to convert the original data into Enformer format?

exnx commented 8 months ago

We follow the procedure that Enformer uses in retrieving train/test intervals from Basenji on the reference genome. Basenji hosts the data (the intervals), yes. The HG38 file itself is public, and can be retrieved from many places.

Sorry, we don't host the data, but it must cost just cents, or a couple of dollars with a GCP account to download. Using predefined intervals (from Enformer/basenji) is just something we chose to do, but there's nothing wrong with just sampling randomly from the HG38 fasta file, in which case you don't need the Enformer/Basenji intervals.