Closed dhimasyoga16 closed 3 years ago
@SamuelCahyawijaya
@dhimasyoga16 : So your input is a pair of sentence and the label binary right?
In that case, you want to extend functionality similar to EntailmentDataset
and EntailmentDataLoader
class instead of DocumentSentimentDataset
and DocumentSentimentDataLoader
.
If you check the EntailmentDataset
class https://github.com/indobenchmark/indonlu/blob/a91815f5c803724d4ed8b536db546967ae660d1c/utils/data_utils.py#L443-L469
you can see that we simply read the csv file, map the label into index, and retrieve 3 columns: sent_A
, sent_B
, and label
, which is exactly what you need.
If you are using Finetuning SMSA.ipynb
I think you can simply change the dataset and dataloader in this case and see if it works. I hope it helps and best of luck for your thesis project 😃
Hi, thank you for the quick answer. Unfortunately, it generate this error
Oh and i forgot to say earlier that my dataset has no header in it. What's the next step to fix this?
@dhimasyoga16 : Yeah, I think that is because there is no header, you can try to:
pandas
pandas
on the official documentation or many other pandas
tutorials.Remember that in EntailmentDataset
the column names are sent_A
, sent_B
, and label
. You can either modify your column to follow that one, or extend the EntailmentDataset
class to follow your column definition.
btw, I have added example for WReTE dataset, you can find the example here: https://github.com/indobenchmark/indonlu/blob/master/examples/finetune_wrete.ipynb
I've did 2 different steps but i still found error :
Step A :
INDEX2LABEL
and LABEL2INDEX
array to this :
load_dataset()
function by changing the LABEL2INDEX
to INDEX2LABEL
(because in my dataset the label values are in 0 and 1, not string), like this :
EntailmentDataset
class to follow my column definition (sent_A
to question1
, sent_B
to question2
, lab
el to is_duplicate
)EntailmentDataset
class but the error still the same (KeyError: label
)Step B :
INDEX2LABEL
and LABEL2INDEX
array to this :
load_dataset()
function by changing the LABEL2INDEX
to INDEX2LABEL
(because in my dataset the label values are in 0 and 1, not string), like this :
EntailmentDataset
class but the error now looks like this :
Sorry if i asking too much questions, i have did some search on stackoverflow but still don't know what's causing the problem
I think it should be LABEL2INDEX not INDEX2LABEL. You can see on our load_dataset
function, we use LABEL2INDEX to convert from label to index:
https://github.com/indobenchmark/indonlu/blob/a91815f5c803724d4ed8b536db546967ae660d1c/utils/data_utils.py#L449-L451
UPDATE
I made change to the data_utils.py
(changing the variables, etc) directly from Google Colab. But when i try to print the INDEX2LABEL value, it's still not changing. It printed like this instead of printed not_duplicate
and duplicate
:
What can i do so it read change that i've did to the file? Maybe such as re-declare or re-call the file or something?
@dhimasyoga16 : That should be updated if you already change the INDEX2LABEL
and LABEL2INDEX
inside the EntailmentDataset
class.
Can you ensure that you have the right code on google colab? Can you share your notebook in google colab as well, it would be easier to discuss if I can check the code directly.
I mean, what i do is :
data_utils.py
in Colab by double click ittrain_dataset = EntailmentDataset(train_dataset_path, tokenizer, lowercase=True)
, but the eror still the sameAnd yeah sure. You can view and download those files (colab notebook, the data_utils.py file which i've made change to it, and the dataset file) here
I made change to the data_utils.py
which the path is indonlu/utils/data_utils.py
. Is it the right file contains the code?
I think the problem is the import part or somewhat the indonlu/utils/data_utils.py
is not updated correctly in your case. I try to load your code locally and use the updated data_utils.py
file and I get the following result
I don't know how you modify the file, but I would recommend you to fork the repo, make some changes to the forked repo, and pull the forked repo from google colab.
I have fork the repo, made change to the file, then pull it in a correct way.
It now can print EntailmentDataset.INDEX2LABEL
correctly.
I'm using the dataset that the headers are adjusted with the EntailmentDataset
class (which the headers are sent_A
, sent_B
, and label
), but yeah the problem still exist with the same error code 😕
Oh, and did you call the EntailmentDataset()
class as well without facing any error?
@dhimasyoga16 : All of our example codes work just fine, we have tested it on different servers. You can try to run it on your end without modification.
It seems like the problem is that your train_headed.csv file has label with value of 0
and 1
, but your LABEL2INDEX
consists of not_duplicate
and duplicate
. In this case you have to change the label 0
into not_duplicate
and label 1
into duplicate
, or you can simply change you LABEL2INDEX
into {0: 0, 1: 1}
in this case.
Bravo!
Changing the LABEL2INDEX
into {0:0, 1: 1}
works like a charm to fix my problem above.
Thank you very much, your answers really helps me 😃
And, i have two more questions to ask :
This one is one of the basic concept of machine learning. usually we have training, dev / validation, and test set.
This one is rather complicated to explain in detail, but in short I can answer with "because we don't have to because we have subword tokenization and we don't want a model learn/predict from meaningless sentence".
You can read more about subword tokenization articles such as https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46 and https://blog.floydhub.com/tokenization-nlp/.
For better understanding of BERT model ( how it works, how is it pretrained, etc), you can read the paper (https://arxiv.org/pdf/1810.04805.pdf). If you are not familiar with transformer model, kindly check (https://arxiv.org/pdf/1706.03762.pdf).
I wish for success in your thesis and in case you want to continue your study in this field in the future, I highly suggest you to actively read research papers, do more research, and publish papers in any prestigious international conference.
Hi, many thanks to IndoBenchmark Team before, for the deployment of IndoBERT model. I'm currently working on my thesis project, it's about sentence similarity detection which the dataset are pair of questions scraped from Quora saved in .csv format with 3 columns : question1, question2, and is_duplicate.
For the training process i'm following the Finetuning SMSA.ipynb. But in the Prepare Dataset section, i found error when i'm running the DocumentSentimentDataset() function, like this :
It seems because the function only "accept" 2 columns from the dataset while in my dataset there are 3 columns.
And i've tried to change the _datautils.py file like this but the error is remain the same :
How can i solve this problem? Thankyou in advance.