Open pszmk opened 2 months ago
lack of masking unnecessarily puts attention to pad tokens - although their embedding is 0 vector it is not a common practoce (it seems to me) to leave pads unmasked. a strange way of counteracting the softmax weight addad to 0 cosine similarityes would be to change temperature but I am pretty confident that masking pads it the usual way to go - sounds natural
https://github.com/D4L-Pigeons/D4L-Hackaton/blob/c309d2f5d4455e930acd132d398ec808658522b1/src/models/components/condition_embedding.py#L272
the padding structure might be established with
batch[cond_ids_name]
where0
denotesPAD