'N' exists in variant datasets

frederikkemarin / BEND

Benchmarking DNA Language Models on Biologically Meaningful Tasks

BSD 3-Clause "New" or "Revised" License

97 stars 13 forks source link

'N' exists in variant datasets #49

Closed yangzhao1230 closed 11 months ago

yangzhao1230 commented 11 months ago

I noticed you pre pre-set embedding_idx in variant taks. However, there exists 'N' in such datasets, which may alter the embedding_idx, because the 'N' occupied a whole token.

yangzhao1230 commented 11 months ago

I only raise the issue when using NTs, as NTs use non-overlapping 6-mer tokenzier.

fteufel commented 11 months ago

Thanks for reporting this! It should be fixable by using upsampling also for NT models. I'll investigate whether results are affected by this and fix it in a PR.