unlabeled samples not processing

isspek commented 3 years ago

In the following method, we are not processing unlabeled samples. I couldn't get why. Could you explain it? thanks https://github.com/bksaini078/MT_Bert_FakeNews/blob/3b619634827f93262bc92202d9c15b4439afd8b3/Mean_Teacher/data_loader.py#L29

bksaini078 commented 3 years ago

I was using unlabel data before in data slices. I have backup for that , removed it because sometimes if unlabel data is small , then model will not work because of lack of unlabel data. Instead of that , I am randomly selecting unlabel data. Can be reverted back to older implementation.

isspek commented 3 years ago

I see. I am changing the code. After finishing it, I will show you. Maybe we might not need this method. For unlabeled samples, I am using Grover fake news generator which yields many samples.

bksaini078 commented 3 years ago

yes sure , thank you

isspek commented 3 years ago

I couldn't get this part, are you trying to add unlabeled samples to labeled samples? Because we assume that unlabeled samples are the noisy samples. https://github.com/bksaini078/MT_Bert_FakeNews/blob/3b619634827f93262bc92202d9c15b4439afd8b3/Mean_Teacher/costfunction.py#L23

bksaini078 commented 3 years ago

Same I discussed with you over last meeting .

isspek commented 3 years ago

Sorry I couldn't get this part. Anyway, after code change I will mark it as todo, you may check it that part. Currently I am concatenating labeled and unlabeled data at the beginning and shuffle.

isspek commented 3 years ago

Okay I changed the strategy there (See the new implementation). Currently, if we use batch size as 1 and add unlabeled samples, the model can replace labeled data with unlabeled data which could yield the model not learning. To prevent this, I removed union shuffled. And take the size of unlabeled samples as reference. In this way we can add more data from the unlabeled samples than the batch size and ensure that the labeled samples are same as batch size. Let me know, whether this strategy does not make sense.

bksaini078 commented 3 years ago

if understand clearly , after augmentation the training data will be more than 1 (if batch size is 1), 1 is labelled data and remaining for unlabelled data . Yes, strategy looks fine for me . One point I would like to mention that , in case of more batch size, it will increase the size of training batch size and for bert I have seen not working because of resource exhausted. Can we include some parameter noise_ratio so we can manipulate the amount of unlabelled data to augment during training?

thank you,

bksaini078 commented 3 years ago

one point I would like to inform you that, in your implementation do try to get the weight of student during training. I faced the issue of not getting the weight when I was trying to do the same way. it was showing the error. I am not sure and still learning... may be you know more on that.

thank you

isspek commented 3 years ago

Yes, I have that issue too. I don't know why I see. I asked it at keras forum. They may answer let's see. I will test some of things today to understand that this happens because of bert or train_step method. If it is not working, I will write own train step method using your code.

isspek commented 3 years ago

I fix the aforementioned error. But teacher model predictions are worse than the student. Either assigning weight are wrong or sth else could be reason. Have you came across such issue before?

bksaini078 commented 3 years ago

I am not sure Ipek . But there is something you can try . instead of trainable _weights , do try trainable_variables.

bksaini078 commented 3 years ago

One question I have Ipek regarding Augment_data function.

if we append unlabel and do not shuffle after append then , 1) Is there not a chance that only input at index 0(labelled data ) will be having correct label in every training step. Will that not impact the training ?

isspek commented 3 years ago

Before that I am shuffling, I am not sure whether it could effect. Currently student model has better results than normal BERT. But I will try your propose. But I also checked some mean teacher models, they use student for predictions. Maybe better student model is not a bad thing.

bksaini078 commented 3 years ago

Actually, as per mean teacher paper , both student and teacher can be used for prediction. But teacher model, will perfrom better than student model if weights are assigned properly. But it depends on alpha, batch size, epochs ,steps . May I know these details , for if alpha is 0.99 and batch size is 1, train data is 50 and epoch is 5-6 , you will not get good teacher prediction.

isspek commented 3 years ago

Now we achieved a better result on MT compating with only Distilbert, around 1.5-2 percentage more. YOu can check my forked code, I have updated. But I only use 2 batch for labeled samples and when I increase the batch, this method throws exception of encountering None value https://github.com/isspek/MT_Bert_FakeNews/blob/289ff45340f3fa4f29de022cd2e272543752df89/src/clf/mt_bert.py#L33 Do you have any idea, what could be the problem?

bksaini078 commented 3 years ago

the problem is model.fit() , when our dataset size is not perfectly divisible by batch_size only then you will face that none issue. for example , 1394 will work for 1,2,17 but wont work for 3,4 ,5 batch size

bksaini078 / MT_Bert_FakeNews

unlabeled samples not processing #5