关于A = square(relu(qk / seq_len + bias))

ShomyLiu commented 2 years ago

你好，非常感谢您用PyTorch复现Flash-Quad，我对这个模型也很感兴趣，有几个小问题，想讨论下：

A = square(relu(qk / seq_len + bias))，这里的seq_len是不是用当前batch的length更合适，代码中https://github.com/JunnYu/FLASHQuad_pytorch/blob/main/flash/gau.py#L117 用的是预设的max_length(如512 ). 不同batch 的序列长度可能是不同的。
您有在不同任务上对比过GAU与Transformer的性能吗我这边试了几个序列建模任务，发现性能会下降，可能训练超参数差异？

谢谢

JunnYu commented 2 years ago

你可以改成batch里面的seqlen试试。我之前试过然后发现模型的输出结果不对劲，于是就改成了max length。我在预训练的时候seqlen基本都是512，也就是说模型只见过512这一个长度，而如果做别的短句子的任务时候，可能seqlen为几十或者一百多，模型都没见过，然后效果不知道为啥不咋行

JunnYu commented 2 years ago

还有我发现我这个预训练的small权重效果不太行，不知道原论文在embedding处用了dropout没，用了layernorm还是scalenorm。最主要的一个疑惑也就是你提出的那个部分，我不太清楚模型的这个细节有没有实现错

JunnYu commented 2 years ago

改成seqlen的输出：

pytorch: 天气预报说今天的天[台+0.2037||的+0.0798||定+0.0446||好+0.0422||以+0.0386]很好，那么我[大+0.1093||的+0.0697||本+0.0629||以+0.0559||一+0.0518]一起去公园玩吧！

使用max_length的输出：

pytorch: 天气预报说今天的天[气+0.9948||空+0.0011||色+0.0007||候+0.0004||势+0.0003]很好，那么我[就+0.4915||们+0.4186||也+0.0753||还+0.0021||都+0.0016]一起去公园玩吧！

aoom commented 2 years ago

改成seqlen的输出：

pytorch: 天气预报说今天的天[台+0.2037||的+0.0798||定+0.0446||好+0.0422||以+0.0386]很好，那么我[大+0.1093||的+0.0697||本+0.0629||以+0.0559||一+0.0518]一起去公园玩吧！

使用max_length的输出：

pytorch: 天气预报说今天的天[气+0.9948||空+0.0011||色+0.0007||候+0.0004||势+0.0003]很好，那么我[就+0.4915||们+0.4186||也+0.0753||还+0.0021||都+0.0016]一起去公园玩吧！

谢谢分享，从结果上看max_length的效果更好！

ShomyLiu commented 2 years ago

谢谢回复分享结果，看结果貌似max的确合理很多。比较奇怪～我也再测试下，不同设置下的结果，到时候贴出来看看。

我这边主要是用GAU部分来替代self-attention 和 FNN来做序列建模任务，并不是语料预训练的MLM任务，目前基本上没有提升。而且发现，学习率对模型效果影响蛮大的, 波动很明显，可能也是因为数据集的原因。

ShomyLiu commented 2 years ago

Hi, 还有个小问题，关于文章中的RoPE要用到GAU单元内部呢，一般位置向量不是直接融合到最开始的Embedding模块吗？这里有什么原因吗？

JaheimLee commented 2 years ago

那直接让qk / attention_mask.sum(-1)[:, None, None]是不是思路上更合适一些

ShomyLiu commented 2 years ago

Hi, 还有个小问题，关于文章中的RoPE要用到GAU单元内部呢，一般位置向量不是直接融合到最开始的Embedding模块吗？这里有什么原因吗？

又回头看了看Rope大概知道了～通过在q,k中用RoPE 能够体现相对位置编码

JaheimLee commented 2 years ago

看了苏神的代码,他的l确实是从mask那来的,而且放在了激活函数的外部. https://github.com/bojone/bert4keras/blob/8bf47989488009c2b8f68c20a97000fb96e07f9b/bert4keras/layers.py#L583

JunnYu commented 2 years ago

原论文是这样实现的，苏神后来改了一下，修改了一下缩放的地方

ShomyLiu commented 2 years ago

大概说一下，在我们的序列建模任务上，Flash-Quad的效果总是比Transformer还是低1-2个点。调了很长时间，一直上不去，而且收敛也慢。

JunnYu commented 2 years ago

我也感觉当前实现的效果不太行，因此还是要等官方代码放出来才知道他里面的一些细节到底怎么处理的，比如A = square(relu(qk / seq_len + bias))这个部分的代码。

ShomyLiu commented 2 years ago

是呀，我这边也是测试了不少模块，从最开始的严格按照论文和伪代码，到自己改动改动，最终结果都还是比不上Transformer，可能是GAU的通用性没有那么强。

JaheimLee commented 2 years ago

是呀，我这边也是测试了不少模块，从最开始的严格按照论文和伪代码，到自己改动改动，最终结果都还是比不上Transformer，可能是GAU的通用性没有那么强。

有尝试把仿射变换改回全连接吗,总感觉这个操作有点神奇

ShomyLiu commented 2 years ago

这个还没，我测试一下，而且比较奇怪的地方是，文章也没有用dropout；

JunnYu commented 2 years ago

我发现给的伪代码中这个rel_pos_bias好像也有点问题，下面这个是原始的实现方式。

import torch

max_position_embeddings = 512
w = torch.arange(2 * max_position_embeddings - 1).float()
print(w.long())
def rel_pos_bias(seq_len, w):
    # Construct Toeplitz matrix directly when the sequence length is less than 512
    t = torch.nn.functional.pad(w[: 2 * seq_len - 1], [0, seq_len]).repeat(seq_len)
    t = t[..., :-seq_len].reshape(-1, seq_len, 3 * seq_len - 2)
    r = (2 * seq_len - 1) // 2
    t = t[..., r:-r]
    return t
#############
seqlen = 4
rel_pos_bias(seqlen, w)
# tensor([[[3., 4., 5., 6.],
#          [2., 3., 4., 5.],
#          [1., 2., 3., 4.],
#          [0., 1., 2., 3.]]])
#############
seqlen = 8
rel_pos_bias(seqlen, w)
# tensor([   0,    1,    2,  ..., 1020, 1021, 1022])
# tensor([[[ 7.,  8.,  9., 10., 11., 12., 13., 14.],
#          [ 6.,  7.,  8.,  9., 10., 11., 12., 13.],
#          [ 5.,  6.,  7.,  8.,  9., 10., 11., 12.],
#          [ 4.,  5.,  6.,  7.,  8.,  9., 10., 11.],
#          [ 3.,  4.,  5.,  6.,  7.,  8.,  9., 10.],
#          [ 2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.],
#          [ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.],
#          [ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.]]])

感觉这样不对劲，于是我改成了下面的这种形式，感觉下面这种形式才对。

seqlen = 4
rel_pos_bias(512, w)[:,:seqlen,:seqlen]
# tensor([[[511., 512., 513., 514.],
#          [510., 511., 512., 513.],
#          [509., 510., 511., 512.],
#          [508., 509., 510., 511.]]])
seqlen = 8
rel_pos_bias(512, w)[:,:seqlen,:seqlen]
# tensor([[[511., 512., 513., 514., 515., 516., 517., 518.],
#          [510., 511., 512., 513., 514., 515., 516., 517.],
#          [509., 510., 511., 512., 513., 514., 515., 516.],
#          [508., 509., 510., 511., 512., 513., 514., 515.],
#          [507., 508., 509., 510., 511., 512., 513., 514.],
#          [506., 507., 508., 509., 510., 511., 512., 513.],
#          [505., 506., 507., 508., 509., 510., 511., 512.],
#          [504., 505., 506., 507., 508., 509., 510., 511.]]])

ShomyLiu commented 2 years ago

Hi, 前几天有其他事情耽搁了，最近开始接着看GAU了。发现您的复现关于attention_mask的位置好像不太对，不过不确定这个是不是导致GAU性能不好的原因； https://github.com/JunnYu/FLASHQuad_pytorch/blob/main/flash/gau.py#L116-L124

kernel = torch.square(torch.relu(
            qk / self.max_position_embeddings + bias))
        # attention_mask
if attention_mask is not None:
    assert attention_mask.ndim == 2
            attn_mask = (
                attention_mask[:, None, :] * attention_mask[:, :, None]
            ).type_as(x)
    kernel *= attn_mask

这里应该先mask，再计算归一化： relu(qk)**2

我也在测试一些性能

JunnYu commented 2 years ago

感觉影响应该不大吧，先mask掉就qk的部分值成了0，对0进行relu，square操作还是0，主要区别是多加了bias部分。还有我mask的时候把矩阵padding位置的行和列都进行mask了。还有正常来说mask一般不是都施加给attetnion注意力得分的吗？

我现在正在使用 https://github.com/lucidrains/FLASH-pytorch 的代码训练small的模型。也差不多快训练完了，你之后可以试试 https://wandb.ai/junyu/huggingface/runs/1jg2jlgt?workspace=user-junyu

ShomyLiu commented 2 years ago

一般mask 应该是在注意力得分的归一化之前呀，这样后续进行归一化比如softmax 才有意义，正常位置的注意力得分和为1。如果先softmax再做mask，那正常的位置的得分之和是没有归一的。只不过这里的归一化函数变成了 relu**2
嗯嗯感谢老哥提供预训练的权重。我也去测试下Lucidrains的flash复现，对比看看下结果～

JunnYu commented 2 years ago

我先上传个19W步数的给你试试

JunnYu commented 2 years ago

import torch
from flash_pytorch import FLASHTransformer
from transformers import BertTokenizerFast
model = FLASHTransformer(
    num_tokens=12000,          # number of tokens
    dim=768,                   # model dimension
    depth=12,                  # depth
    causal=False,              # autoregressive or not
    group_size=256,            # size of the groups
    query_key_dim=128,         # dimension of queries / keys
    expansion_factor=2.,       # hidden dimension = dim * expansion_factor
    # in the paper, they claimed scalenorm led to faster training at no performance hit. the other option is 'layernorm' (also default)
    norm_type='scalenorm',
    shift_tokens=False
)
tokenizer = BertTokenizerFast.from_pretrained("junnyu/roformer_chinese_char_base")
model.load_state_dict(torch.load("flash.pt", map_location="cpu"))
model.eval()
text = "中国的首都是[MASK]京。"
inputs = tokenizer(text, return_tensors="pt", padding="max_length", max_length=512) #这里必须是512，不然结果不对。

with torch.no_grad():
    pt_outputs = model(inputs["input_ids"])[0]

pt_outputs_sentence = "pytorch: "
for i, id in enumerate(tokenizer.encode(text)):
    if id == tokenizer.mask_token_id:
        val,idx = pt_outputs[i].softmax(-1).topk(k=5)
        tokens = tokenizer.convert_ids_to_tokens(idx)
        new_tokens = []
        for v,t in zip(val.cpu(),tokens):
            new_tokens.append(f"{t}+{round(v.item(),4)}")
        pt_outputs_sentence += "[" + "||".join(new_tokens) + "]"
    else:
        pt_outputs_sentence += "".join(
            tokenizer.convert_ids_to_tokens([id], skip_special_tokens=True))
print(pt_outputs_sentence)
# pytorch: 中国的首都是[北+0.8221||南+0.0787||东+0.0559||西+0.0055||中+0.0033]京。

ShomyLiu commented 2 years ago

赞一个，我也测试下。然后在另外的序列任务上也尝试下，看看是否有提升

JunnYu commented 2 years ago

权重 https://wss1.cn/f/7z0orce18tp 复制链接到浏览器打开

JunnYu commented 2 years ago

small版本 + 25W训练步数 + batch_size 128 + lr 1e-4 + 线性衰减学习率 + max_length 512
最终训练集MLM准确率51%左右
权重现已添加：https://huggingface.co/junnyu/flash_small_wwm_cluecorpussmall
完整训练日志：https://wandb.ai/junyu/huggingface/runs/1jg2jlgt

ShomyLiu commented 2 years ago

感谢！简单测试了这几个权重，感觉Flash模型在语言任务上效果是挺好的。但是在我这这边另外一个非预训练的序列建模上，效果总是提不上去，很诡异，差transformer略多。不过速度是真快。

JunnYu commented 2 years ago

新权重padding到最大长度512的可能效果会好一点把。不知道你实验的适合有没有padding到最大长度。

最终训练集MLM准确率51%左右

权重现已添加：https://huggingface.co/junnyu/flash_small_wwm_cluecorpussmall

完整训练日志：https://wandb.ai/junyu/huggingface/runs/1jg2jlgt

ShomyLiu commented 2 years ago

您说的是所有序列都padding到512吗。这个还测试过，我这边设置最长为512，不过tokenize的时候，根据batch内最长的序列作为当前batch的seq_len 。

JunnYu commented 2 years ago

tokenizer(text, return_tensors="pt", max_length=512, padding="max_length")

JunnYu commented 2 years ago

我测试这新的代码的时候，发现短的文本不padding到512，预测结果不大理想

ShomyLiu commented 2 years ago

对，我刚刚改成这样了，之前用的是padding='longest'；改成padding到512的时候，

在FLASH模型上，结果总算正常一些了，稍微接近transformer了。之前动态batch 长度（基本在一二百的长度）就是很不正常的差。
不过在GAU上还是有点问题。我再看看哪里的问题。可能也是实现seqlen的问题。
之前复现的GAU与新版的GAU 基本都是按照论文复现的，也都是短文本不太行。感觉这里比较奇怪，如果是很长序列的话，都padding到512，1014，4096啥的，太浪费显存了呀。

您先早点休息哈

JunnYu commented 2 years ago

不知道这里是否应该除以n，感觉除以g也就是256或者固定成512后，效果可能好点把。然后也就不需要padding到512长度了。https://github.com/JunnYu/FLASHQuad_pytorch/blob/e5902617f4573c9edd967313eba8f01234b5cebf/flash/flash_lucidrains.py#L274

JunnYu commented 2 years ago

这是我跑cluener数据集 + Globalpointer dev集的结果。

# flash n = 512
ADDRESS = Score(f1=0.589928, precision=0.636646, recall=0.549598, tp=205, pred=322, gold=373)
BOOK = Score(f1=0.767025, precision=0.856, recall=0.694805, tp=107, pred=125, gold=154)
COMPANY = Score(f1=0.80677, precision=0.864048, recall=0.756614, tp=286, pred=331, gold=378)
GAME = Score(f1=0.838063, precision=0.825658, recall=0.850847, tp=251, pred=304, gold=295)
GOVERNMENT = Score(f1=0.794239, precision=0.807531, recall=0.781377, tp=193, pred=239, gold=247)
MOVIE = Score(f1=0.798561, precision=0.874016, recall=0.735099, tp=111, pred=127, gold=151)
NAME = Score(f1=0.859935, precision=0.868421, recall=0.851613, tp=396, pred=456, gold=465)
ORGANIZATION = Score(f1=0.812317, precision=0.879365, recall=0.754768, tp=277, pred=315, gold=367)
POSITION = Score(f1=0.771499, precision=0.824147, recall=0.725173, tp=314, pred=381, gold=433)
SCENE = Score(f1=0.655914, precision=0.748466, recall=0.583732, tp=122, pred=163, gold=209)
micro_f1 = Score(f1=0.775321, precision=0.818675, recall=0.736328, tp=2262, pred=2763, gold=3072)
macro_f1 = Score(f1=0.769425, precision=0.81843, recall=0.728363, tp=0, pred=0, gold=0)
mean_f1 = 0.772373

# flash n = seqlen
ADDRESS = Score(f1=0.587393, precision=0.630769, recall=0.549598, tp=205, pred=325, gold=373)
BOOK = Score(f1=0.765957, precision=0.84375, recall=0.701299, tp=108, pred=128, gold=154)
COMPANY = Score(f1=0.788051, precision=0.852308, recall=0.732804, tp=277, pred=325, gold=378)
GAME = Score(f1=0.83, precision=0.816393, recall=0.844068, tp=249, pred=305, gold=295)
GOVERNMENT = Score(f1=0.785863, precision=0.807692, recall=0.765182, tp=189, pred=234, gold=247)
MOVIE = Score(f1=0.786441, precision=0.805556, recall=0.768212, tp=116, pred=144, gold=151)
NAME = Score(f1=0.851852, precision=0.863135, recall=0.84086, tp=391, pred=453, gold=465)
ORGANIZATION = Score(f1=0.803371, precision=0.828986, recall=0.779292, tp=286, pred=345, gold=367)
POSITION = Score(f1=0.780952, precision=0.805897, recall=0.757506, tp=328, pred=407, gold=433)
SCENE = Score(f1=0.651685, precision=0.789116, recall=0.555024, tp=116, pred=147, gold=209)
micro_f1 = Score(f1=0.769754, precision=0.80519, recall=0.737305, tp=2265, pred=2813, gold=3072)
macro_f1 = Score(f1=0.763156, precision=0.80436, recall=0.729385, tp=0, pred=0, gold=0)
mean_f1 = 0.766455

# 权重uer/chinese_roberta_L-6_H-768
ADDRESS = Score(f1=0.65445, precision=0.639386, recall=0.670241, tp=250, pred=391, gold=373)
BOOK = Score(f1=0.787097, precision=0.782051, recall=0.792208, tp=122, pred=156, gold=154)
COMPANY = Score(f1=0.808729, precision=0.785536, recall=0.833333, tp=315, pred=401, gold=378)
GAME = Score(f1=0.805031, precision=0.750733, recall=0.867797, tp=256, pred=341, gold=295)
GOVERNMENT = Score(f1=0.825336, precision=0.784672, recall=0.870445, tp=215, pred=274, gold=247)
MOVIE = Score(f1=0.806667, precision=0.812081, recall=0.801325, tp=121, pred=149, gold=151)
NAME = Score(f1=0.853306, precision=0.819444, recall=0.890086, tp=413, pred=504, gold=464)
ORGANIZATION = Score(f1=0.801609, precision=0.788918, recall=0.814714, tp=299, pred=379, gold=367)
POSITION = Score(f1=0.798216, precision=0.771552, recall=0.82679, tp=358, pred=464, gold=433)
SCENE = Score(f1=0.660465, precision=0.642534, recall=0.679426, tp=142, pred=221, gold=209)
micro_f1 = Score(f1=0.784443, precision=0.759451, recall=0.811136, tp=2491, pred=3280, gold=3071)
macro_f1 = Score(f1=0.780091, precision=0.757691, recall=0.804636, tp=0, pred=0, gold=0)
mean_f1 = 0.782267

# cluener数据集 softmax 
# flash 将n设为512
ADDRESS = Score(f1=0.520516, precision=0.4625, recall=0.595174, tp=222, pred=480, gold=373)
BOOK = Score(f1=0.683544, precision=0.666667, recall=0.701299, tp=108, pred=162, gold=154)
COMPANY = Score(f1=0.698337, precision=0.633621, recall=0.777778, tp=294, pred=464, gold=378)
GAME = Score(f1=0.809061, precision=0.773994, recall=0.847458, tp=250, pred=323, gold=295)
GOVERNMENT = Score(f1=0.730337, precision=0.679443, recall=0.789474, tp=195, pred=287, gold=247)
MOVIE = Score(f1=0.742857, precision=0.713415, recall=0.774834, tp=117, pred=164, gold=151)
NAME = Score(f1=0.815047, precision=0.792683, recall=0.83871, tp=390, pred=492, gold=465)
ORGANIZATION = Score(f1=0.685366, precision=0.620309, recall=0.765668, tp=281, pred=453, gold=367)
POSITION = Score(f1=0.727072, precision=0.697034, recall=0.759815, tp=329, pred=472, gold=433)
SCENE = Score(f1=0.584541, precision=0.590244, recall=0.578947, tp=121, pred=205, gold=209)
micro_f1 = Score(f1=0.701856, precision=0.658766, recall=0.750977, tp=2307, pred=3502, gold=3072)
macro_f1 = Score(f1=0.699668, precision=0.662991, recall=0.742916, tp=0, pred=0, gold=0)
mean_f1 = 0.700762

# flash n保持为seqlen
ADDRESS = Score(f1=0.52581, precision=0.476087, recall=0.587131, tp=219, pred=460, gold=373)
BOOK = Score(f1=0.681529, precision=0.66875, recall=0.694805, tp=107, pred=160, gold=154)
COMPANY = Score(f1=0.665874, precision=0.604752, recall=0.740741, tp=280, pred=463, gold=378)
GAME = Score(f1=0.785256, precision=0.744681, recall=0.830508, tp=245, pred=329, gold=295)
GOVERNMENT = Score(f1=0.741996, precision=0.693662, recall=0.797571, tp=197, pred=284, gold=247)
MOVIE = Score(f1=0.721003, precision=0.684524, recall=0.761589, tp=115, pred=168, gold=151)
NAME = Score(f1=0.794148, precision=0.772358, recall=0.817204, tp=380, pred=492, gold=465)
ORGANIZATION = Score(f1=0.683609, precision=0.640476, recall=0.73297, tp=269, pred=420, gold=367)
POSITION = Score(f1=0.734967, precision=0.709677, recall=0.762125, tp=330, pred=465, gold=433)
SCENE = Score(f1=0.54382, precision=0.512712, recall=0.578947, tp=121, pred=236, gold=209)
micro_f1 = Score(f1=0.691098, precision=0.650848, recall=0.736654, tp=2263, pred=3477, gold=3072)
macro_f1 = Score(f1=0.687801, precision=0.650768, recall=0.730359, tp=0, pred=0, gold=0)
mean_f1 = 0.68945

# 权重uer/chinese_roberta_L-6_H-768
ADDRESS = Score(f1=0.570302, precision=0.559278, recall=0.581769, tp=217, pred=388, gold=373)
BOOK = Score(f1=0.765823, precision=0.746914, recall=0.785714, tp=121, pred=162, gold=154)
COMPANY = Score(f1=0.745592, precision=0.711538, recall=0.783069, tp=296, pred=416, gold=378)
GAME = Score(f1=0.778295, precision=0.717143, recall=0.850847, tp=251, pred=350, gold=295)
GOVERNMENT = Score(f1=0.763158, precision=0.712281, recall=0.821862, tp=203, pred=285, gold=247)
MOVIE = Score(f1=0.740506, precision=0.709091, recall=0.774834, tp=117, pred=165, gold=151)
NAME = Score(f1=0.829069, precision=0.789474, recall=0.872845, tp=405, pred=513, gold=464)
ORGANIZATION = Score(f1=0.723785, precision=0.681928, recall=0.771117, tp=283, pred=415, gold=367)
POSITION = Score(f1=0.780172, precision=0.731313, recall=0.836028, tp=362, pred=495, gold=433)
SCENE = Score(f1=0.652381, precision=0.649289, recall=0.655502, tp=137, pred=211, gold=209)
micro_f1 = Score(f1=0.739298, precision=0.703529, recall=0.778899, tp=2392, pred=3400, gold=3071)
macro_f1 = Score(f1=0.734908, precision=0.700825, recall=0.773359, tp=0, pred=0, gold=0)
mean_f1 = 0.737103

ShomyLiu commented 2 years ago

看结果，感觉

Flash模型好像在recall上指标都不太行。我这边的序列任务，虽然也是个分类任务，但是比较看重recall，相比比Transformer/Bert，Flash的recal数值要低一些。而且收敛比Transformer慢的多，Transformer基本上8k steps 就可以差不多，Flash得2w左右。
而且你这边seq len 或者512，好像结果并不明显，哪个好哪个差。我这边如果seqlen的话，就差不少。调了好久了，总体感觉Flash 可能还真不如Transformer啊不管是性能还是通用性上。可能就是一个linear了

One-sixth commented 1 year ago

@JunnYu @ShomyLiu 我也遇到了和你们一样的问题，也是搞了一段时间，也分享下我的发现。我用来做自回归翻译文本生成，loss正常下降，但生成效果非常烂，基本是胡言乱语，调试发现是 cause 掩码失效了，即输入第 n 个位置词，会影响第 n-1 之前的全部输出词的概率。原因就是那个 seq_len，我将其改为一个固定值 q.shape[-1] ，cause掩码的效果就恢复了，自回归性能也比原来好非常多。但还是比一般的多头 transformer 差一点。

注：我没有使用rope位置编码（因为我发现对训练速度影响较大，并且收敛也慢一些），只使用T5的相对位置加性编码。

JunnYu / FLASHQuad_pytorch

关于A = square(relu(qk / seq_len + bias)) #1