BBuf / RWKV-World-HF-Tokenizer

31 stars 5 forks source link

fixes eventually for HF repo and v6 #2

Closed SmerkyG closed 4 months ago

SmerkyG commented 4 months ago

I merged everything for both v5 and v6, and this should match Arthur's latest version quite closely and be mergeable there as well. But there is some unknown issue remaining with the merged v6 code. Somehow, my older rwkv_world_v6_model_batch code responds more correctly. I haven't been able to spot the difference that's causing that yet.

BBuf commented 4 months ago

I merged everything for both v5 and v6, and this should match Arthur's latest version quite closely and be mergeable there as well. But there is some unknown issue remaining with the merged v6 code. Somehow, my older rwkv_world_v6_model_batch code responds more correctly. I haven't been able to spot the difference that's causing that yet.

Can the v6 version model implementation in this PR chat normaly? Additionally, is the script scripts/convert5to6.py used to convert the rwkv5.2 model to v6, and is this script unrelated to the v6 weights found at https://huggingface.co/BlinkDL/rwkv-6-world/tree/main?

SmerkyG commented 4 months ago

the code in rwkv_world_v6_model_batch seems to work normally, but the one in rwkv6_model doesn't seem quite right yet (tho it seems close)... haven't figured out why, despite trying to compare them line by line

Yes, the convert5to6.py script can be used to upconvert a v5 model to v6, but that has nothing directly to do with huggingface and it's unnecessary here - I just needed to make it for other reasons and it ended up in my repo

SmerkyG commented 4 months ago

Oh, and I made it so that in inference mode (training == False) it will always call the 'cpu' code instead of the cuda, since the cuda state generation doesn't work properly with the cuda we're using here. Blink does have a newer cuda revision that may work for that for both v5 and v6.

BBuf commented 4 months ago

Oh, and I made it so that in inference mode (training == False) it will always call the 'cpu' code instead of the cuda, since the cuda state generation doesn't work properly with the cuda we're using here. Blink does have a newer cuda revision that may work for that for both v5 and v6.

Okay, let's go ahead and use the non-CUDA kernel implementation here. I agree with Arthur's advice that we should prepare a seperate CUDA package for rwkv to address performance issues. I will make the necessary modifications to this PR to ensure compatibility with the latest state and then proceed with the merge. Thank you.

SmerkyG commented 4 months ago

I also have performant non-cuda code for v5 and v6, which is nearly the same speed as the CUDA version when under torch.compile. It has its own peculiarities, but it does fully support state generation and gradients. You can see it integrated into infctx trainer most recently if you're interested.

SmerkyG commented 4 months ago

Interesting findings on the 'bad output' in v6: the difference came down to groupnorm dtype mismatching the calculations. If I keep both output and groupnorm in float32 it works well, and same if I keep output and groupnorm in bfloat16 (but state and decay in float32). But when we mix and match, doing the inner calculations in float32 but using a bfloat16 groupnorm it causes a repetitive bad output. This is what the difference was, and may be impacting v5 as well.

Note that this dtype changed when Arthur moved the groupnorm out of rwkv_linear_attention_v5_cpu and lxw, lxb were no longer being cast to float32

SmerkyG commented 4 months ago

I made a commit that fixes this so it's back to how it was in your original v5 version and works well for v6. As for what precision we should really be keeping things in, I'm not certain.

BBuf commented 4 months ago

I made a commit that fixes this so it's back to how it was in your original v5 version and works well for v6. As for what precision we should really be keeping things in, I'm not certain.

Thank you. I've also provided some suggestions on the file structure; please address them, and then we can proceed with merging the PR.

SmerkyG commented 4 months ago

I got rid of those extra dirs per your request

BBuf commented 4 months ago

I got rid of those extra dirs per your request

good. Please fix conflict and we can merge!

SmerkyG commented 4 months ago

seems good now

BBuf commented 4 months ago

seems good now

LGTM! merged.

SmerkyG commented 4 months ago

Excellent! I had a quick question about getting this to work for my purposes on HF, sent you a DM on discord when you have a chance to look.

BBuf commented 4 months ago

Excellent! I had a quick question about getting this to work for my purposes on HF, sent you a DM on discord when you have a chance to look.

replied.

SmerkyG commented 3 months ago

Not sure what change caused this, but I've been using the new tokenization_rwkv5.py for evals and it seems to be broken compared to the old tokenization_rwkv_world.py code. It works sometimes, but not others.


    tokens = self.tokenize(text, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 617, in tokenize
    tokenized_text.extend(self._tokenize(token))
                          ^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 145, in _tokenize
    return self.wordpiece_tokenizer.tokenize(text.encode("utf-8"))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 97, in tokenize
    sub_tokens.append(cur_substr.decode())
                      ^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data```
BBuf commented 3 months ago

Not sure what change caused this, but I've been using the new tokenization_rwkv5.py for evals and it seems to be broken compared to the old tokenization_rwkv_world.py code. It works sometimes, but not others.

    tokens = self.tokenize(text, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 617, in tokenize
    tokenized_text.extend(self._tokenize(token))
                          ^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 145, in _tokenize
    return self.wordpiece_tokenizer.tokenize(text.encode("utf-8"))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 97, in tokenize
    sub_tokens.append(cur_substr.decode())
                      ^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data```

Can you provide the reproduction code?

BBuf commented 3 months ago

Not sure what change caused this, but I've been using the new tokenization_rwkv5.py for evals and it seems to be broken compared to the old tokenization_rwkv_world.py code. It works sometimes, but not others.

    tokens = self.tokenize(text, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 617, in tokenize
    tokenized_text.extend(self._tokenize(token))
                          ^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 145, in _tokenize
    return self.wordpiece_tokenizer.tokenize(text.encode("utf-8"))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 97, in tokenize
    sub_tokens.append(cur_substr.decode())
                      ^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data```

or try https://huggingface.co/RWKV/rwkv-6-world-1b6

SmerkyG commented 3 months ago

Unfortunately it was part of running evals that I didn't write and are part of a suite via lm eval harness. One that failed was boolq though, if that helps.

On Thu, Mar 21, 2024, 6:07 PM Xiaoyu Zhang @.***> wrote:

Not sure what change caused this, but I've been using the new tokenization_rwkv5.py for evals and it seems to be broken compared to the old tokenization_rwkv_world.py code. It works sometimes, but not others.

tokens = self.tokenize(text, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/actions-runner/_work/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 617, in tokenize tokenized_text.extend(self._tokenize(token)) ^^^^^^^^^^^^^^^^^^^^^ File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 145, in _tokenize return self.wordpiece_tokenizer.tokenize(text.encode("utf-8")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 97, in tokenize sub_tokens.append(cur_substr.decode()) ^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data```

Can you provide the reproduction code?

— Reply to this email directly, view it on GitHub https://github.com/BBuf/RWKV-World-HF-Tokenizer/pull/2#issuecomment-2014133574, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDK33TFMWDNQ3JNECTYAH3YZN75JAVCNFSM6AAAAABE6YH77KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJUGEZTGNJXGQ . You are receiving this because you modified the open/close state.Message ID: @.***>

BBuf commented 3 months ago

Unfortunately it was part of running evals that I didn't write and are part of a suite via lm eval harness. One that failed was boolq though, if that helps. On Thu, Mar 21, 2024, 6:07 PM Xiaoyu Zhang @.*> wrote: Not sure what change caused this, but I've been using the new tokenization_rwkv5.py for evals and it seems to be broken compared to the old tokenization_rwkv_world.py code. It works sometimes, but not others. tokens = self.tokenize(text, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/actions-runner/_work/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 617, in tokenize tokenized_text.extend(self._tokenize(token)) ^^^^^^^^^^^^^^^^^^^^^ File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 145, in _tokenize return self.wordpiece_tokenizer.tokenize(text.encode("utf-8")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/actions-runner/_work/lm-evaluation-harness/lm-evaluation-harness/modules/transformers_modules/SmerkyG/rwkv6-world-3b/b2788b7fef7d5c400c3b77e8d135acadfb97f388/tokenization_rwkv5.py", line 97, in tokenize sub_tokens.append(cur_substr.decode()) ^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data``` Can you provide the reproduction code? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDK33TFMWDNQ3JNECTYAH3YZN75JAVCNFSM6AAAAABE6YH77KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJUGEZTGNJXGQ . You are receiving this because you modified the open/close state.Message ID: @.>

I remember that harness can print out the specific prompt that caused an error.

SmerkyG commented 3 months ago

I'm not running these manually at the moment and don't have time to investigate right now, but it happens on anli as well and is currently broken in the existing RWKV/rwkv-5-world-1b5 and RWKV/rwkv-5-world-3b HF repos

Since it's broken I would suggest reverting and creating a test suite to ensure that the rewritten tokenizer works exactly like the old one before pushing to production

ArthurZucker commented 3 months ago

On it!