ChunyuanLI / Optimus

Optimus: the first large-scale pre-trained VAE language model
368 stars 37 forks source link

Suggestion for some added functions #4

Closed summerstay closed 4 years ago

summerstay commented 4 years ago

Your program works very well! I rewrote the interpolation function to make it easier for me to use in different ways. Perhaps others would also find this useful.

def latent_code_from_text(text, encoder_tokenizer, model_vae, args):
    tokenized1 = encoder_tokenizer.encode(text)
    tokenized1 = [101] + tokenized1 + [102]
    coded1 = torch.Tensor([tokenized1])
    coded1 =torch.Tensor.long(coded1)
    with torch.no_grad():
        x0 = coded1
        x0 = x0.to(args.device)
        pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
        mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
        latent_z = mean.squeeze(1)  
        coded_length = len(tokenized1)
        return latent_z, coded_length

def text_from_latent_code(latent_z, model_vae,sentence_length,args, decoder_tokenizer):
    past = latent_z
    context_tokens = decoder_tokenizer.encode('<BOS>')
    coded_length = torch.Tensor([[sentence_length]])
    coded_length = torch.Tensor.long(coded_length)
    length = torch.Tensor([[sentence_length]])
    out = sample_sequence_conditional(
        model=model_vae.decoder,
        context=context_tokens,
        past=past,
        length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
        temperature=args.temperature,
        top_k=args.top_k,
        top_p=args.top_p,
        device=args.device,
        decoder_tokenizer = decoder_tokenizer
    )
    text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
    text_x1 = text_x1.split()[1:-1]
    text_x1 = ' '.join(text_x1)
    return text_x1

...

# and then in the main function         
latent_z1, coded_length1 = latent_code_from_text("a brown dog likes to eat his food very slowly .", tokenizer_encoder, model_vae, args)
latent_z2, coded_length2 = latent_code_from_text("a yellow cat likes to chase a long string .", tokenizer_encoder, model_vae, args)

result = text_from_latent_code((latent_z1 + latent_z2)/2, model_vae,coded_length1,args, tokenizer_decoder)
print(result)
summerstay commented 4 years ago

For example, this one-line change:

latent_z1, coded_length1 = latent_code_from_text("a brown dog likes to eat his food very slowly .", tokenizer_encoder, model_vae, args)
latent_z2, coded_length2 = latent_code_from_text("a yellow cat likes to chase a long string .", tokenizer_encoder, model_vae, args)
latent_z3, coded_length3 = latent_code_from_text("a yellow cat likes to chase a short string .", tokenizer_encoder, model_vae, args)

result = text_from_latent_code(latent_z2 - latent_z3 + latent_z1, model_vae,coded_length1,args, tokenizer_decoder)

results in the sentence "a brown dog likes to eat his whole food so fast.", so is forming sentence analogies.

ChunyuanLI commented 4 years ago

Brilliant work. I update the code to incorporate the functions (edited a bit), see here.

ChunyuanLI commented 4 years ago

Releasing a demo for latent space manipulation, including sentence interpolation and analogy. Hope it makes it easier to interact with the model.