MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.2k stars 146 forks source link

Training combined model on Databricks hangs #90

Closed GullyBurns closed 2 years ago

GullyBurns commented 3 years ago

Description

Trying to run the basic CTM demo for the combined TM from this CoLab notebook : https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing#scrollTo=stAb2Q4eBB3W

ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=50, num_epochs=20) ctm.fit(training_dataset) # run the model

What I Did

This is the screenshot (from databricks) that has not changed for ~30 minutes.

image

Running this on the standard dbpedia set just hangs on a given specific epoch: https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt

I expected later epochs to take roughly the same amount of time as earlier ones and so when an epoch takes much longer, it seems to be a block rather than just the code taking a long time.

vinid commented 3 years ago

Hi @GullyBurns!

I am not really familiar with databricks, is there a way for me to test it?

it's weird because the model runs smoothly both locally and on Google Colab.

GullyBurns commented 3 years ago

Hey Federico,

Our implementation is really firewalled so there's no way for you to tinker locally. Is there any way to set logging parameters in the source code to get some diagnostics?

I can ask some of our local Databricks experts to take a look and see what might be going on.

But the basic demo on the dbpedia data should build in a few minutes, right?

Gul

vinid commented 3 years ago

There's no logging implemented yet, but I can probably work on this and add some diagnostic

Yes it should take a bunch of minutes to complete, the very same demo on google colab can be easily run in a few minutes, that is why I cannot really understand what's happening.

tuulia commented 3 years ago

Hello. I had the same problem with the Databricks and the model stopped working randomly after some epochs. What worked for me was to change the number of workers on Dataloader to 0.
Tuulia

GullyBurns commented 2 years ago

We just got confirmation from the Databricks folks about this as a solution.

ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=50, num_epochs=20, num_data_loader_workers=0)