MeetKai / functionary

Chat language model that can use tools and interpret the results
MIT License
1.36k stars 107 forks source link

assert_packing_loss.py Invalid for deepseek-v2-lite #266

Open bao-xiaoyi opened 1 week ago

bao-xiaoyi commented 1 week ago

RuntimeError: CUDA error: an illegal memory access was encountered

Looking forward to the expert's answer

khai-meetkai commented 1 week ago

Hi @bao-xiaoyi, can you send me your command you ran assert_packing_loss.py ?

bao-xiaoyi commented 1 week ago

python assert_packing_loss.py /kas/kas_workspace/open_llm/DeepSeek-Coder-V2-Lite-Instruct

bao-xiaoyi commented 1 week ago

你好@bao-xiaoyi,你能把你运行的命令发给我assert_packing_loss.py吗?

Additionally, when I use Starcoderv2 for testing, there are also errors reported: assert ( original_token_count == mk_token_count ), f"number of tokens for computing loss is different: original_token_count = {original_token_count}, mk_token_count={mk_token_count}"

bao-xiaoyi commented 1 week ago

Hi @bao-xiaoyi, can you send me your command you ran assert_packing_loss.py ?

When I use starcoderv2, original_token_count = 147277, And mk_token_count=4014

khai-meetkai commented 6 days ago

Hi @bao-xiaoyi, I think the reason for this error is because for this model, it uses the remote code (I mean, it is using modeling_deepseek.py). So you can do as follows:

def get_max_seqlen_in_batch(attention_mask):
    max_num = torch.max(attention_mask)
    # attention_mask: B x N
    counts = []
    for i in range(1, max_num + 1):
        counts.append(
            torch.sum(attention_mask == i, axis=-1)
        )  # shape: B, count length of data point maksed with i
    result = torch.stack(counts, axis=1)
    result = result.flatten()
    return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32)

def _get_unpad_data(attention_mask):
    print("monkey-patched")
    seqlens_in_batch = get_max_seqlen_in_batch(
        attention_mask
    )  # attention_mask.sum(dim=-1, dtype=torch.int32)
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    cu_seqlens = F.pad(
        torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0)
    )
    return (
        indices,
        cu_seqlens,
        max_seqlen_in_batch,
    )

About assert_packing_loss.py you can change as follows:

remote_deepseek.zip assert_packing_loss.py.zip

khai-meetkai commented 6 days ago

@bao-xiaoyi for starcoder, which base_model you used, I tested following command and it works:

python assert_packing_loss.py bigcode/starcoder2-7b

bao-xiaoyi commented 6 days ago

@bao-xiaoyi对于 starcoder,您使用了哪个 base_model,我测试了以下命令并且它有效:

python assert_packing_loss.py bigcode/starcoder2-7b

I chose the 15b model, and the average loss is a bit large

bao-xiaoyi commented 6 days ago

你好@bao-xiaoyi,我认为出现此错误的原因是因为对于此模型,它使用了远程代码(我的意思是,它正在使用models_deepseek.py)。因此,您可以执行以下操作:

  • 直接从https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct/tree/main复制所有 .py 文件 并保存到文件夹中,例如remote_deepseek(在文件夹中打包)。直接替换函数 _get_unpad_data以使用以下 monkey-patched 代码(这相当于 monkey-patch: modeling_deepseek._get_unpad_data = get_unpad_data)。您也可以下载remote_deepseek.zip我在这篇文章中附加的
def get_max_seqlen_in_batch(attention_mask):
    max_num = torch.max(attention_mask)
    # attention_mask: B x N
    counts = []
    for i in range(1, max_num + 1):
        counts.append(
            torch.sum(attention_mask == i, axis=-1)
        )  # shape: B, count length of data point maksed with i
    result = torch.stack(counts, axis=1)
    result = result.flatten()
    return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32)

def _get_unpad_data(attention_mask):
    print("monkey-patched")
    seqlens_in_batch = get_max_seqlen_in_batch(
        attention_mask
    )  # attention_mask.sum(dim=-1, dtype=torch.int32)
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    cu_seqlens = F.pad(
        torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0)
    )
    return (
        indices,
        cu_seqlens,
        max_seqlen_in_batch,
    )

关于assert_packing_loss.py你可以进行如下更改:

  • 在计算原始数据的损失时,使用加载模型transformers.AutoModelForCausalLM
  • 在计算打包数据的损失时,使用DeepseekV2ForCausalLMfrom remote_deepseek.modeling_deepseek import DeepseekV2ForCausalLM)加载模型 ,你会看到损失结果几乎相同,差异仅在:0.0021% 你也可以下载assert_packing_loss.py我在这篇文章中提供的

remote_deepseek.zipassert_packing_loss.py.zip

I don't quite understand why local code should be used when using packing, and remote code can be used when not packing? Why doesn't modeling_deepseek._get_unpad_data = get_unpad_data work?

bao-xiaoyi commented 6 days ago

Hi @bao-xiaoyi, I think the reason for this error is because for this model, it uses the remote code (I mean, it is using modeling_deepseek.py). So you can do as follows:

  • Directly copy all .py files from https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct/tree/main and save to a folder, for example, remote_deepseek (in the folder packing). Directly replace the function _get_unpad_data to use the following monkey-patched code (this is equivalent to monkey-patch: modeling_deepseek._get_unpad_data = get_unpad_data ). You can also download the remote_deepseek.zip I attached in this post
def get_max_seqlen_in_batch(attention_mask):
    max_num = torch.max(attention_mask)
    # attention_mask: B x N
    counts = []
    for i in range(1, max_num + 1):
        counts.append(
            torch.sum(attention_mask == i, axis=-1)
        )  # shape: B, count length of data point maksed with i
    result = torch.stack(counts, axis=1)
    result = result.flatten()
    return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32)

def _get_unpad_data(attention_mask):
    print("monkey-patched")
    seqlens_in_batch = get_max_seqlen_in_batch(
        attention_mask
    )  # attention_mask.sum(dim=-1, dtype=torch.int32)
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    cu_seqlens = F.pad(
        torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0)
    )
    return (
        indices,
        cu_seqlens,
        max_seqlen_in_batch,
    )

About assert_packing_loss.py you can change as follows:

  • in computing the loss of original data, load model using transformers.AutoModelForCausalLM
  • in computing the loss of packed data, load model using DeepseekV2ForCausalLM (from remote_deepseek.modeling_deepseek import DeepseekV2ForCausalLM) You will see the result that the loss results are almost the same, the difference is only: 0.0021% You can also download the assert_packing_loss.py I provided in this Post

remote_deepseek.zip assert_packing_loss.py.zip

Moreover, the comparison of time consumption does not seem as exaggerated as shown in the readme. I tested Deepseek using the code you modified, and the time comparison is 18.712671 vs 7.400667 or 9.163215 vs 6.737796

khai-meetkai commented 6 days ago

@bao-xiaoyi I think directly monkey-patching remote code (trust_remote_code=True) doesn't work, and to find out the reason, we must investigate deeper on how transformers implemented this feature, I haven't investigated this.

khai-meetkai commented 6 days ago

By the way, I have just run:

python original_assert.py bigcode/starcoder2-15b no errors were found. The result is: the difference between losses are only: 0.0011%

bao-xiaoyi commented 6 days ago

@bao-xiaoyi我认为直接对远程代码进行 monkey-patching ( trust_remote_code=True) 不起作用,要找出原因,我们必须更深入地研究 transformers 如何实现这个特性,我还没有调查过这个。

Can you provide the time comparison results of your testing on Deepseek? Thank you very much

khai-meetkai commented 6 days ago

In running this: python assert_packing_loss.py deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct in the file (assert_packing_loss.py.) I sent you above, time for computing the loss without packing: 9.336643 time for computing the loss with packing: 2.348312