Closed kirill-vishniakov closed 8 months ago
We follow the procedure that Enformer uses in retrieving train/test intervals from Basenji on the reference genome. Basenji hosts the data (the intervals), yes. The HG38 file itself is public, and can be retrieved from many places.
Sorry, we don't host the data, but it must cost just cents, or a couple of dollars with a GCP account to download. Using predefined intervals (from Enformer/basenji) is just something we chose to do, but there's nothing wrong with just sampling randomly from the HG38 fasta file, in which case you don't need the Enformer/Basenji intervals.
Hello, thank you very much for open-sourcing your work. I have a couple of questions about the Human Genome Reference dataset.
[Q1] In your instructions for downloading Human Reference Genome you mention:
As far as I know Enformer was using a mix of Basenji dataset and Human Reference Genome. Specifically in their paper they say:
In the filenames for Human Reference Genome you also have reference to Basenji, i.e.
Hyena-DNA paper does not contain any reference to Basenji. So I wonder how do these two datasets relate to each other? Do you train on mix of them as Enformer or only on Human Reference Genome? Also in the Appendix you mention:
Does it imply that you use completely the same data as Enformer and the same train/val splits?
[Q2] Since accessing the data on GCP incurs a cost, you mentioned plans to make the data more accessible. Do you have a timeline for this? Maybe there is script to convert the original data into Enformer format?