frederikkemarin / BEND

Benchmarking DNA Language Models on Biologically Meaningful Tasks
BSD 3-Clause "New" or "Revised" License
95 stars 14 forks source link

Clarification multi_hot #65

Closed tdsone closed 2 months ago

tdsone commented 2 months ago

Hey Frederikke,

I'm probably being stupid about this but I don't understand the use of the multi_hot function properly.

In embed_from_bed you use multi_hot to encode the labels. E.g. for the task gene_finding the labels for the first sample are [8, 8, ... ], i.e. one of nine classes for each nucleotide position. I would assume that you would predict each class for each nucleotide position but instead multi_hot sums over all rows which creates an array that stores the number of occurrences of each class over the whole sequence. Is that understanding correct and how is that useful for the tasks which are sort of annotations that don't mean much as an average over a sequence?

Cheers!

tdsone commented 2 months ago

Or is it that most sections (as defined by chrom, strand, start, end) always have the same label anyways?

fteufel commented 2 months ago

it is a bit implicit... the labels for the position-wise tasks come as hdf5 files, so the multi-hot encoding is not called.

        if hdf5_file is not None: 
            labels = hdf5_file[n + start_offset]
tdsone commented 2 months ago

Ahhh, thanks!