bentrevett / pytorch-seq2seq

Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
MIT License
5.32k stars 1.33k forks source link

torchtext recent version (0.12.0) doesn't support Field, BucketIterator #185

Closed manik2304 closed 8 months ago

manik2304 commented 2 years ago

The recent version of torchtext 0.12.0 doesn't support Field, BuckeIterator, etc. What is the equivalent modules to pre-process the datasets like Multi30k, IWSLT2016, IWSLT2017 etc? Thanks.

johnnyhwu commented 2 years ago

I use torchtext with version = 0.11 solves the problem. conda install pytorch torchtext=0.11 cudatoolkit=11.3 -c pytorch

Jiazxu commented 1 year ago

Torchtext >= 0.12 had removed Field and lagacy modules. You can try THIS :

from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

from collections import Counter
from torchtext.datasets import Multi30k
from torchtext.vocab import vocab
from torchtext.data import get_tokenizer
saqib-sarwar commented 1 year ago

@Jiazxu What to do in case of custom dataset stored as a csv file? How to load it? And then perform train validation split.

Jiazxu commented 1 year ago

@Jiazxu What to do in case of custom dataset stored as a csv file? How to load it? And then perform train validation split.

It can be done by the Panda Lirary. First, tansforms the .csv file to a torch.utils.data.Dataset class. The code is like (Details depend on your data content):

import pandas as pd
import torch
import copy
from torch.utils.data import DataLoader, Dataset

class xxx:
    def xxx:

        data = pd.read_csv(data_dir)
        data_tensor = torch.tensor(data.values)
        label = copy.copy(data_tensor)

    return data, label

Then you can put the DataSet_csv into the DataLoader.