Human Reference Genome questions

Hello, thank you very much for open-sourcing your work. I have a couple of questions about the Human Genome Reference dataset.

[Q1] In your instructions for downloading Human Reference Genome you mention:

First step is download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file). However, you'll need to have a GCP account to download the exact files we used (from the Enformer), and it cost a little to download. At some point we'll try to upload somewhere to share that data.

As far as I know Enformer was using a mix of Basenji dataset and Human Reference Genome. Specifically in their paper they say:

We modified the Basenji2 dataset by extending the input sequence to 196,608bp from the original 131,072bp using the hg38 reference genome.

In the filenames for Human Reference Genome you also have reference to Basenji, i.e.

Download fasta (.fa format) file (of the entire human genome) into hyena-dna/data/hg38. ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically gsutil -u hai-gcp-hippo cp gs://basenji_barnyard/hg38.ml.fa.gz ./ && gunzip hg38.ml.fa.gz

Hyena-DNA paper does not contain any reference to Basenji. So I wonder how do these two datasets relate to each other? Do you train on mix of them as Enformer or only on Human Reference Genome? Also in the Appendix you mention:

For pretraining, we use a single human reference genome (Genome Reference Consortium, 2013), and leverage the training and validation intervals (start and end) from (Avsec et al., 2021).

Does it imply that you use completely the same data as Enformer and the same train/val splits?

[Q2] Since accessing the data on GCP incurs a cost, you mentioned plans to make the data more accessible. Do you have a timeline for this? Maybe there is script to convert the original data into Enformer format?

HazyResearch / hyena-dna

Human Reference Genome questions #23