IndoNLP / indonlu

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)
https://indobenchmark.com
Apache License 2.0
532 stars 189 forks source link

ValueError: invalid literal for int() with base 10: 'sentiment' #43

Closed widyaputeriaulia10 closed 1 year ago

widyaputeriaulia10 commented 1 year ago

Expected Behavior

Dear Author,

I want to make multiclass classification by modify DocumentSentimentDataset,

class DocumentSentimentDataset(Dataset):

Static constant variable

LABEL2INDEX = {'Ekonomi': 0, 'Hukum': 1, 'Kesehatan': 2, 'Sosial':3, 'Teknologi':4}
INDEX2LABEL = {0: 'Ekonomi', 1: 'Hukum', 2: 'Kesehatan', 3 : 'Sosial', 4 : 'Teknologi'}
NUM_LABELS = 5

def load_dataset(self, path): 
    df = pd.read_csv(path, sep='\t', header=None)
    df.columns = ['text','sentiment']
    #df['sentiment'] = df['sentiment'].apply(lambda lab: self.LABEL2INDEX[lab])
    return df

def __init__(self, dataset_path, tokenizer, no_special_token=False, *args, **kwargs):
    self.data = self.load_dataset(dataset_path)
    self.tokenizer = tokenizer
    self.no_special_token = no_special_token

def __getitem__(self, index):
    data = self.data.loc[index,:]
    text, sentiment = data['text'], data['sentiment']
    subwords = self.tokenizer.encode(text, add_special_tokens=not self.no_special_token)
    return np.array(subwords), np.array(sentiment), data['text']

def __len__(self):
    return len(self.data)  

but when i started to train the model i got error like this :

ValueError: Caught ValueError in DataLoader worker process 12. Original Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/kaggle/working/indonlu/utils/data_utils.py", line 550, in _collate_fn sentiment_batch[i,0] = sentiment ValueError: invalid literal for int() with base 10: 'sentiment'

i have checked that 'setniment' column was int.

Do you have any advices to my problem ?

Thank You in Advance

SamuelCahyawijaya commented 1 year ago

Hi @widyaputeriaulia10 , thank you for using IndoNLU.

I couldn't comment much on the problem, since the data nor the exact loading script are provided.

One possible problem that I think can produce this error is that perhaps your CSV file has a header row. In our case, we use df = pd.read_csv(path, sep='\t', header=None) since there is no column information in the data.

If your CSV contains a header row, you can omit the header=None when loading the dataset.

Let me know if the problem persists, and please also send the code snippet, error message, and view of the data, so that it will be easier for us to trace the problem.

Thank you and hope it helps!

widyaputeriaulia10 commented 1 year ago

dear author, thank you for your responds i just followed your instruction, but other error just occurred. this i attached the code and error massage

n_epochs = 4 for epoch in range(n_epochs): model.train() torch.set_grad_enabled(True)

total_train_loss = 0
list_hyp_train, list_label = [], []

train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
for i, batch_data in enumerate(train_pbar):
    # Forward model
    loss, batch_hyp, batch_label =

forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

    # Update model
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    tr_loss = loss.item()
    total_train_loss = total_train_loss + tr_loss

    # Hitung skor train metrics
    list_hyp_train += batch_hyp
    list_label += batch_label

    train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f}

LR:{:.8f}".format((epoch+1), total_train_loss/(i+1), get_lr(optimizer)))

metrics = document_sentiment_metrics_fn(list_hyp_train, list_label)
print("(Epoch {}) TRAIN LOSS:{:.4f} {} LR:{:.8f}".format((epoch+1),
    total_train_loss/(i+1), metrics_to_string(metrics),

get_lr(optimizer)))

# Evaluate di validation set
model.eval()
torch.set_grad_enabled(False)

total_loss, total_correct, total_labels = 0, 0, 0
list_hyp, list_label = [], []

pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
for i, batch_data in enumerate(pbar):
    batch_seq = batch_data[-1]
    loss, batch_hyp, batch_label =

forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

    # Hitung total loss
    valid_loss = loss.item()
    total_loss = total_loss + valid_loss

    # Hitung skor evaluation metrics
    list_hyp += batch_hyp
    list_label += batch_label
    metrics = document_sentiment_metrics_fn(list_hyp, list_label)

    pbar.set_description("VALID LOSS:{:.4f}

{}".format(total_loss/(i+1), metrics_to_string(metrics)))

metrics = document_sentiment_metrics_fn(list_hyp, list_label)
print("(Epoch {}) VALID LOSS:{:.4f} {}".format((epoch+1),
    total_loss/(i+1), metrics_to_string(metrics)))

this is the error

0%| | 0/16 [00:01<?, ?it/s]

---------------------------------------------------------------------------RuntimeError Traceback (most recent call last)/tmp/ipykernel_23/3084095572.py in 10 for i, batch_data in enumerate(train_pbar): 11 # Forward model---> 12 loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda') 13 14 # Update model /kaggle/working/indonlu/utils/forward_fn.py in forward_sequence_classification(model, batch_data, i2w, is_test, device, **kwargs) 21 22 if device == "cuda":---> 23 subword_batch = subword_batch.cuda() 24 mask_batch = mask_batch.cuda() 25 token_type_batch = token_type_batch.cuda() if token_type_batch is not None else None RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

and this is the data that i saved to .tsv format [image: image.png] i think the error occured because of the accelerator that i used (i used kaggle GPU P100), but is you have other opinion and insight, it will be helpfull, thank you !

Pada tanggal Jum, 17 Mar 2023 pukul 10.07 Samuel Cahyawijaya < @.***> menulis:

Hi @widyaputeriaulia10 https://github.com/widyaputeriaulia10 , thank you for using IndoNLU.

I couldn't comment much on the problem, since the data nor the exact loading script are provided.

One possible problem that I think can produce this error is that perhaps your CSV file has a header row. In our case, we use df = pd.read_csv(path, sep='\t', header=None) since there is no column information in the data.

If your CSV contains a header row, you can omit the header=None when loading the dataset.

Let me know if the problem persists, and please also send the code snippet, error message, and view of the data, so that it will be easier for us to trace the problem.

Thank you and hope it helps!

— Reply to this email directly, view it on GitHub https://github.com/IndoNLP/indonlu/issues/43#issuecomment-1473051556, or unsubscribe https://github.com/notifications/unsubscribe-auth/APSA3T27QJ464RE64Z6OV7TW4PIQBANCNFSM6AAAAAAV57COUI . You are receiving this because you were mentioned.Message ID: @.***>

widyaputeriaulia10 commented 1 year ago

Hi samuel, i trained the model using cpu and it works, thank you