Clarification about special tokens <p_start> and <p_end>

AndreAmaduzzi commented 8 months ago

Hello! Thanks for your work! I find it really interesting. I have a question regarding the special tokens and you use in PointLLM. I would like to know how they look like: if they are the BOS and EOS tokens of the LLM, or if they are trained, or if they have a different structure...

Thanks in advance, Andrea

RunsenXu commented 8 months ago

They are newly added special tokens of the LLM (tokenizer), and their embeddings will be trained. You can refer here: https://github.com/OpenRobotLab/PointLLM/blob/e444053866e3d5d25b1aeff3652b2d3b9c12ceea/pointllm/model/pointllm.py#L325C25-L325C25

AndreAmaduzzi commented 8 months ago

I understand. I have an additional question: is the function add_tokens a built-in function of the PreTrainedTokenizer class? Can you point me to the source code of this function? since I cannot find it in your repo.

I am also wondering why you need such special tokens. I guess it works better than simply concatenating the point tokens with the ones of the question... So, I guess you tried with concatenation, too.

Thanks, Andrea

RunsenXu commented 8 months ago

Yes, it's a built-in function. You should refer to transformers's source codes.

    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
        """
        Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
        it with indices starting from length of the current vocabulary.

        Args:
            new_tokens (`List[str]`or `List[tokenizers.AddedToken]`):
                Token(s) to add in vocabulary. A token is only added if it's not already in the vocabulary (tested by
                checking if the tokenizer assign the index of the `unk_token` to them).
            special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the tokens should be added as special tokens.

        Returns:
            `int`: The number of tokens actually added to the vocabulary.

        Examples:

        ```python
        # Let's see how to increase the vocabulary of Bert model and tokenizer
        tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        model = BertModel.from_pretrained("bert-base-uncased")

        num_added_toks = tokenizer.add_tokens(["new_tok1", "my_new-tok2"])
        print("We have added", num_added_toks, "tokens")
        # Note: resize_token_embeddings expects to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
        model.resize_token_embeddings(len(tokenizer))
        ```"""
        new_tokens = [str(tok) for tok in new_tokens]

        tokens_to_add = []
        for token in new_tokens:
            if not isinstance(token, str):
                raise TypeError(f"Token {token} is not a string but a {type(token)}.")
            if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
                token = token.lower()
            if (
                token != self.unk_token
                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
                and token not in tokens_to_add
            ):
                tokens_to_add.append(token)
                if self.verbose:
                    logger.info(f"Adding {token} to the vocabulary")

        added_tok_encoder = {tok: len(self) + i for i, tok in enumerate(tokens_to_add)}
        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}
        self.added_tokens_encoder.update(added_tok_encoder)
        self.added_tokens_decoder.update(added_tok_decoder)

        # Make sure we don't split on any special tokens (even they were already in the vocab before e.g. for Albert)
        if special_tokens:
            if len(new_tokens) == 1:
                _insert_one_token_to_ordered_list(self.unique_no_split_tokens, new_tokens[0])
            else:
                self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(new_tokens)))
        else:
            # Or on the newly added tokens
            if len(tokens_to_add) == 1:
                _insert_one_token_to_ordered_list(self.unique_no_split_tokens, tokens_to_add[0])
            else:
                self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add)))
        self._create_trie(self.unique_no_split_tokens)

        return len(tokens_to_add)

OpenRobotLab / PointLLM

Clarification about special tokens <p_start> and <p_end> #6