frederikkemarin / BEND

Benchmarking DNA Language Models on Biologically Meaningful Tasks
BSD 3-Clause "New" or "Revised" License
97 stars 13 forks source link

dirty data exists in variant dataset #51

Closed yangzhao1230 closed 8 months ago

yangzhao1230 commented 11 months ago

Upon reviewing the genomic data, i have identified inconsistencies between some reference nucleotides and those in the genome. For instance, within the expression dataset, two instances of data were detected where their corresponding reference nucleotides in the genome are both C. image image

yangzhao1230 commented 11 months ago

It seems that the problem is isolated to the expression dataset, potentially occurring during the transition from the hg37 to hg38 genomic assembly.

fteufel commented 10 months ago

Thanks @yangzhao1230 - I'll remove these samples from the dataset.

Will close the issue once the file online is replaced.

fteufel commented 8 months ago

replaced