IndoNLP / indonlu

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)
https://indobenchmark.com
Apache License 2.0
532 stars 189 forks source link

Adjust DocumentSentimentDataset() to read dataset with 3 columns #9

Closed dhimasyoga16 closed 3 years ago

dhimasyoga16 commented 3 years ago

Hi, many thanks to IndoBenchmark Team before, for the deployment of IndoBERT model. I'm currently working on my thesis project, it's about sentence similarity detection which the dataset are pair of questions scraped from Quora saved in .csv format with 3 columns : question1, question2, and is_duplicate.

For the training process i'm following the Finetuning SMSA.ipynb. But in the Prepare Dataset section, i found error when i'm running the DocumentSentimentDataset() function, like this : eror1

It seems because the function only "accept" 2 columns from the dataset while in my dataset there are 3 columns.

And i've tried to change the _datautils.py file like this but the error is remain the same : eror2

How can i solve this problem? Thankyou in advance.

gentaiscool commented 3 years ago

@SamuelCahyawijaya

SamuelCahyawijaya commented 3 years ago

@dhimasyoga16 : So your input is a pair of sentence and the label binary right? In that case, you want to extend functionality similar to EntailmentDataset and EntailmentDataLoaderclass instead of DocumentSentimentDataset and DocumentSentimentDataLoader.

If you check the EntailmentDataset class https://github.com/indobenchmark/indonlu/blob/a91815f5c803724d4ed8b536db546967ae660d1c/utils/data_utils.py#L443-L469 you can see that we simply read the csv file, map the label into index, and retrieve 3 columns: sent_A, sent_B, and label, which is exactly what you need.

If you are using Finetuning SMSA.ipynb I think you can simply change the dataset and dataloader in this case and see if it works. I hope it helps and best of luck for your thesis project 😃

dhimasyoga16 commented 3 years ago

Hi, thank you for the quick answer. Unfortunately, it generate this error eror3_entailment

Oh and i forgot to say earlier that my dataset has no header in it. What's the next step to fix this?

SamuelCahyawijaya commented 3 years ago

@dhimasyoga16 : Yeah, I think that is because there is no header, you can try to:

  1. Load your headerless CSV with pandas
  2. Add the column name
  3. Save back the CSV You can read more on how to do that with pandas on the official documentation or many other pandas tutorials.

Remember that in EntailmentDataset the column names are sent_A, sent_B, and label. You can either modify your column to follow that one, or extend the EntailmentDataset class to follow your column definition.

btw, I have added example for WReTE dataset, you can find the example here: https://github.com/indobenchmark/indonlu/blob/master/examples/finetune_wrete.ipynb

dhimasyoga16 commented 3 years ago

I've did 2 different steps but i still found error :

Step A :

  1. Add header to the file (question1, question2, _isduplicate) with pandas
  2. Change the INDEX2LABELand LABEL2INDEX array to this : labelling
  3. Extend the load_dataset() function by changing the LABEL2INDEXto INDEX2LABEL(because in my dataset the label values are in 0 and 1, not string), like this : ubah_index2label_isDuplicate
  4. Extend the EntailmentDataset class to follow my column definition (sent_Ato question1, sent_Bto question2, label to is_duplicate)
  5. Call the EntailmentDataset class but the error still the same (KeyError: label)

Step B :

  1. Add header to the file (_sentA, _sentB, label) with pandas
  2. Change the INDEX2LABEL and LABEL2INDEX array to this : labelling
  3. Extend the load_dataset() function by changing the LABEL2INDEXto INDEX2LABEL(because in my dataset the label values are in 0 and 1, not string), like this : ubah_index2label
  4. Call the EntailmentDataset class but the error now looks like this : eror4

Sorry if i asking too much questions, i have did some search on stackoverflow but still don't know what's causing the problem

SamuelCahyawijaya commented 3 years ago

I think it should be LABEL2INDEX not INDEX2LABEL. You can see on our load_dataset function, we use LABEL2INDEX to convert from label to index: https://github.com/indobenchmark/indonlu/blob/a91815f5c803724d4ed8b536db546967ae660d1c/utils/data_utils.py#L449-L451

dhimasyoga16 commented 3 years ago

UPDATE

I made change to the data_utils.py (changing the variables, etc) directly from Google Colab. But when i try to print the INDEX2LABEL value, it's still not changing. It printed like this instead of printed not_duplicate and duplicate : print_INDEX2LABEL_value

What can i do so it read change that i've did to the file? Maybe such as re-declare or re-call the file or something?

SamuelCahyawijaya commented 3 years ago

@dhimasyoga16 : That should be updated if you already change the INDEX2LABEL and LABEL2INDEX inside the EntailmentDataset class.

Can you ensure that you have the right code on google colab? Can you share your notebook in google colab as well, it would be easier to discuss if I can check the code directly.

dhimasyoga16 commented 3 years ago

I mean, what i do is :

  1. Open the data_utils.py in Colab by double click it
  2. Made changes like i did above and save the changes
  3. Re-run the train_dataset = EntailmentDataset(train_dataset_path, tokenizer, lowercase=True), but the eror still the same

And yeah sure. You can view and download those files (colab notebook, the data_utils.py file which i've made change to it, and the dataset file) here I made change to the data_utils.py which the path is indonlu/utils/data_utils.py. Is it the right file contains the code?

SamuelCahyawijaya commented 3 years ago

I think the problem is the import part or somewhat the indonlu/utils/data_utils.py is not updated correctly in your case. I try to load your code locally and use the updated data_utils.py file and I get the following result

Screenshot 2020-10-15 at 2 49 41 PM

I don't know how you modify the file, but I would recommend you to fork the repo, make some changes to the forked repo, and pull the forked repo from google colab.

dhimasyoga16 commented 3 years ago

I have fork the repo, made change to the file, then pull it in a correct way. It now can print EntailmentDataset.INDEX2LABEL correctly. I'm using the dataset that the headers are adjusted with the EntailmentDataset class (which the headers are sent_A, sent_B, and label), but yeah the problem still exist with the same error code 😕 eror5

Oh, and did you call the EntailmentDataset() class as well without facing any error?

SamuelCahyawijaya commented 3 years ago

@dhimasyoga16 : All of our example codes work just fine, we have tested it on different servers. You can try to run it on your end without modification.

It seems like the problem is that your train_headed.csv file has label with value of 0 and 1, but your LABEL2INDEX consists of not_duplicate and duplicate. In this case you have to change the label 0 into not_duplicate and label 1 into duplicate, or you can simply change you LABEL2INDEX into {0: 0, 1: 1} in this case.

dhimasyoga16 commented 3 years ago

Bravo! Changing the LABEL2INDEX into {0:0, 1: 1} works like a charm to fix my problem above. Thank you very much, your answers really helps me 😃

And, i have two more questions to ask :

  1. What does the dev set used for? I mean, yeah it's for evaluation, but can you kindly explain more about it?
  2. On the example notebook, it seems that the dataset didn't go through preprocessing stage (stemming and stopword removal). Is there some reason for that? Or has preprocessing stage been done before?
SamuelCahyawijaya commented 3 years ago
  1. This one is one of the basic concept of machine learning. usually we have training, dev / validation, and test set.

    • Training set is used for training
    • Dev / validation set is the data that model not learn from and is used for choosing the best model
    • Test set is the data that the model never seen before for calculating the final metric
  2. This one is rather complicated to explain in detail, but in short I can answer with "because we don't have to because we have subword tokenization and we don't want a model learn/predict from meaningless sentence".

You can read more about subword tokenization articles such as https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46 and https://blog.floydhub.com/tokenization-nlp/.

For better understanding of BERT model ( how it works, how is it pretrained, etc), you can read the paper (https://arxiv.org/pdf/1810.04805.pdf). If you are not familiar with transformer model, kindly check (https://arxiv.org/pdf/1706.03762.pdf).

I wish for success in your thesis and in case you want to continue your study in this field in the future, I highly suggest you to actively read research papers, do more research, and publish papers in any prestigious international conference.