agiresearch / OpenP5

OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems
Apache License 2.0
256 stars 20 forks source link

New tokens are not sorted, causing new token embedding incorrectly match when loading the model #4

Closed lzl65825 closed 1 year ago

lzl65825 commented 1 year ago

In main.py, line 190:

    if args.item_indexing == 'collaborative':
        for ds in train_loader.dataset.datasets:
            tokenizer.add_tokens(ds.new_token)

the new tokens are not sorted. It will cause randomly order of token IDs for the newly added tokens. Thus, when loading the model from line 200:

    if args.load:
        if local_rank == 0:
            logging.info(f"Load model from {args.model_path}")
        model = utils.load_model(model, args.model_path, args, loc=device)
        model.to(device)

the new token embedding may not match in the training process.

lzl65825 commented 1 year ago

Probably it will not influence this repo results, but users need to be very careful when using this (either keep the same order of added tokens, or sort the new tokens before adding).