Genome build versions are inconsistent in reference and chromatin profile (DeepSEA benchmark)

kohanlee1995 commented 10 months ago

I am attempting to replicate the validation process for the DeepSEA benchmark. The original DeepSEA version is hg19, while the reference genome is hg38. I've noticed that liftover is available in the source code, specifically within the ChromatinProfileDataset class. However, using this liftover functionality seems to be restricted unless I directly use the ChromatinProfileDataset.

Within the class ChromatinProfile, arguments for ChromatinProfileDataset are:

ref_genome_version = self.ref_genome_version
coords_target_path = f'{self.data_path}/{split}_{self.ref_genome_version}_coords_targets.csv'

This code forces the genome version of the reference and dataset to be the same.

My question is whether it's possible to introduce flexibility into the package or provide an updated version of the DeepSEA benchmark that supports the hg38 reference genome?

exnx commented 10 months ago

We weren't planning to make these changes, but we certainly welcome any contributions to the repo!

cbirchsy commented 10 months ago

Hi @kohanlee1995,

You are right there is some functionality in ChromatinProfileDataset which is only available when using the dataset directly rather than in the dataloader.

However, if you just once use the ChromatinProfileDataset class directly with the setting save_liftover=True the updated version of the DeepSEA dataset files with hg38 coordinates will be created.

It will create the files {train,val,test}_hg38_coords_targets.csv so the line

coords_target_path = f'{self.data_path}/{split}_{self.ref_genome_version}_coords_targets.csv'

in the dataloader will then work with ref_genome_version = 'hg38'.

kohanlee1995 commented 10 months ago

Thank you @exnx and @cbirchsy for your response. That's what I did and it worked. Just wanted to make sure I am on the right path.

HazyResearch / hyena-dna

Genome build versions are inconsistent in reference and chromatin profile (DeepSEA benchmark) #35