Training hyperparameters and objective

emaadmanzoor commented 4 years ago

Hi, thank you for open-sourcing your work!

I am working on applying your technique to adjust for textual confounders. My dataset consists of text-pairs from a conversation between a customer service agent and a client. My (ordinal) treatment is the customer service agent's average rating, and my outcome is binary (client satisfaction). I understand that your theory may not apply for ordinal treatments, but am interested in the method nevertheless (to compare with LDA).

I understand that I need to fine-tune BERT to predict (i) the customer service agent's average rating, and (ii) the conversation outcome.

However, I could not find guidelines in the paper on how to select the hyperparameters and regularize the fine-tuning. If I simply train for a large number of epochs, I believe I can overfit the training data and achieve zero loss on both prediction objectives. Is this desirable, or should I hold out a validation subset and select the model with the lowest overall validation loss?

I am also concerned about the embeddings moving away from the initial values (which are a good representation of language) towards values that help predict the two outcomes, but do not represent language well.

Thanks!

vveitch commented 4 years ago

We're glad you're using it :)

We simply trained for as long as the loss decreased, using all the data. Overfitting didn't seem to be an issue for us in practice; I believe that matches the usual experience with these very deep neural nets.

Note that most of our fine tuning uses both the supervised objective and the (masked words) unsupervised objective. This serves to regularize the model.

Another note: folk wisdom holds that you should train the model using the pure unsupervised objective to convergence on your data before you do the supervised pre-training. IIRC, that also helped for us.

We'll be interested to see where you get!

emaadmanzoor commented 4 years ago

Note that most of our fine tuning uses both the supervised objective and the (masked words) unsupervised objective. This serves to regularize the model.

This addresses my concern, thank you!

blei-lab / causal-text-embeddings

Training hyperparameters and objective #2