关于6.3.3 随机采样和相邻采样的疑惑

bug描述

# 本函数已保存在d2lzh_pytorch包中方便以后使用
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 减1是因为输出的索引x是相应输入的索引y加1
    num_examples = (len(corpus_indices) - 1) // num_steps
    epoch_size = num_examples // batch_size
    example_indices = list(range(num_examples))
    random.shuffle(example_indices)

    # 返回从pos开始的长为num_steps的序列
    def _data(pos):
        return corpus_indices[pos: pos + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    for i in range(epoch_size):
        # 每次读取batch_size个随机样本
        i = i * batch_size
        batch_indices = example_indices[i: i + batch_size]
        X = [_data(j * num_steps) for j in batch_indices]
        Y = [_data(j * num_steps + 1) for j in batch_indices]
        yield torch.tensor(X, dtype=torch.float32, device=device), torch.tensor(Y, dtype=torch.float32, device=device)

以上是随机采样的写法，但是觉得有两个问题。首先，因为for i in range(epoch_size)的关系，所以实际上每一次都是从下标为0的开始采样。对于下面所给的测试，实际上x只能在0-23产生batch，也就是x的batch一直都不包括24,25,26,27,28。

# 测试
my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

# 所给的结果
X:  tensor([[18., 19., 20., 21., 22., 23.],
        [12., 13., 14., 15., 16., 17.]]) 
Y: tensor([[19., 20., 21., 22., 23., 24.],
        [13., 14., 15., 16., 17., 18.]]) 

X:  tensor([[ 0.,  1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10., 11.]]) 
Y: tensor([[ 1.,  2.,  3.,  4.,  5.,  6.],
        [ 7.,  8.,  9., 10., 11., 12.]])

Q1：那其实在实现随机采样的时候，是不是应该保证有一部分epoch包含的batch有24,25,26,27,28(不知道我有没有理解错)。同理，在相邻采样中也有同样的情况。 Q2: 此外，上面的写法生成一定是batch_size=2的数据，当有数据剩余且数据量小于batch_size=2的数据量时就不会生成。但是在全连接和CNN中，我们读取的小批量数据在最后一个batch中数据量往往小于batch_size。因此在这里，假设上面的测试剩余了大于batch_size=1的数据（如设置my_seq = list(range(32))，此时有8个数据未被抽取），是否继续采样一个batch_size=1的数据，望解惑!

# Q2的情况如下：
# 测试
my_seq = list(range(32))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

# 结果（这里包含了0<batch_size<=2的情况）
X:  tensor([[18., 19., 20., 21., 22., 23.],
        [ 0.,  1.,  2.,  3.,  4.,  5.]]) 
Y: tensor([[19., 20., 21., 22., 23., 24.],
        [ 1.,  2.,  3.,  4.,  5.,  6.]]) 

X:  tensor([[12., 13., 14., 15., 16., 17.],
        [24., 25., 26., 27., 28., 29.]]) 
Y: tensor([[13., 14., 15., 16., 17., 18.],
        [25., 26., 27., 28., 29., 30.]]) 

X:  tensor([[ 6.,  7.,  8.,  9., 10., 11.]]) 
Y: tensor([[ 7.,  8.,  9., 10., 11., 12.]])

以下是我另外写随机采样的，保证了我上述说的情况

def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 减1是因为输出的索引x是相应输入的索引y加1
    num_examples = (len(corpus_indices) - 1) // num_steps
    # 随机抽样的起始位置
    sample_start = np.random.randint((len(corpus_indices) - 1) % num_steps + 1)
    example_indices = np.arange(sample_start, len(corpus_indices), num_steps)[:num_examples]
    np.random.shuffle(example_indices)

    # 转gpu
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # 每个读取batch_size个随机样本
    for idx in np.arange(0, len(example_indices), batch_size):
        batch_example = example_indices[idx:(idx+batch_size)]
        x = [corpus_indices[pos:(pos+num_steps)] for pos in batch_example]
        y = [corpus_indices[(pos+1):(pos+1+num_steps)] for pos in batch_example]
        yield torch.tensor(x, dtype=torch.float32, device=device), torch.tensor(y, dtype=torch.float32, device=device)

测试结果

my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

# 结果:
X:  tensor([[14., 15., 16., 17., 18., 19.],
        [ 8.,  9., 10., 11., 12., 13.]], device='cuda:0') 
Y: tensor([[15., 16., 17., 18., 19., 20.],
        [ 9., 10., 11., 12., 13., 14.]], device='cuda:0') 

X:  tensor([[ 2.,  3.,  4.,  5.,  6.,  7.],
        [20., 21., 22., 23., 24., 25.]], device='cuda:0') 
Y: tensor([[ 3.,  4.,  5.,  6.,  7.,  8.],
        [21., 22., 23., 24., 25., 26.]], device='cuda:0')

版本信息 pytorch: 1.6.0 torchvision: 0.7.0 torchtext: ...

ShusenTang / Dive-into-DL-PyTorch

关于6.3.3 随机采样和相邻采样的疑惑 #160