frederikkemarin / BEND

Benchmarking DNA Language Models on Biologically Meaningful Tasks
BSD 3-Clause "New" or "Revised" License
95 stars 14 forks source link

Discrepancy in Genome Versions Used for Chromatin Accessibility Compared to Other ENCODE-Based Studies #55

Closed yangzhao1230 closed 6 months ago

yangzhao1230 commented 6 months ago

Why is the GRCh37 genome used for the chromatin accessibility task while histone modification and CpG methylation tasks, which also source their datasets from ENCODE, utilize the GRCh38 genome? What accounts for the different genome versions being used across these related tasks? image

fteufel commented 6 months ago

Hi @yangzhao1230 , we used GRCh37 for this task because it was faster for us to process the data this way. There's no further biological reason behind it.