HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena
https://arxiv.org/abs/2306.15794
Apache License 2.0
532 stars 74 forks source link

The tokenizer's bug in Huggingface #50

Open Horikitasaku opened 4 months ago

Horikitasaku commented 4 months ago

The code in tokenizer

has a bug, it seem to be missing the [CLS]

the code in huggingface is

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        # cls = [self.cls_token_id]
        result = token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

but I think it should be

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        result = cls + token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

according to the code from github

Horikitasaku commented 4 months ago

I'm not sure why the author commented the cls But I think it's a bug Please let me know if I'm wrong.

My idea is that since the model has the ability to fine-tune downstream classification tasks, we should keep the [CLS].