Closed tdsone closed 2 months ago
Or is it that most sections (as defined by chrom, strand, start, end) always have the same label anyways?
it is a bit implicit... the labels for the position-wise tasks come as hdf5 files, so the multi-hot encoding is not called.
if hdf5_file is not None:
labels = hdf5_file[n + start_offset]
Ahhh, thanks!
Hey Frederikke,
I'm probably being stupid about this but I don't understand the use of the multi_hot function properly.
In embed_from_bed you use
multi_hot
to encode the labels. E.g. for the task gene_finding the labels for the first sample are [8, 8, ... ], i.e. one of nine classes for each nucleotide position. I would assume that you would predict each class for each nucleotide position but instead multi_hot sums over all rows which creates an array that stores the number of occurrences of each class over the whole sequence. Is that understanding correct and how is that useful for the tasks which are sort of annotations that don't mean much as an average over a sequence?Cheers!