Closed syGOAT closed 7 months ago
It ran for more than 20 minutes without ending. It is too long. I set only 3 candidate topics (ZeroShotClassification(candidate_topics[:3], model="./bart-large-mnli")) and code ended after 4mins.
It is difficult to say whether this is short or long without knowing a bit more about your environment. Are you using a GPU? If not, then that might explain the problem you are facing.
So the problem may be that there are too many candidate topics. But I don't think 70 is a lot. Maybe you can optimize the parallel batch processing capability, when the pipeline of transformers is not the problem?
Most likely, the compute time is a result of the transformers pipeline and parallel batch processing is a bit more involved when you are dealing with GPUs. Generally, I think this should be reasonably fast if you are using a GPU.
@MaartenGr Thank you for reply! I used a gpu. I think this is where the problem lies:
https://github.com/MaartenGr/BERTopic/blob/6c9eb6e72a881077ac59c35752d26e391bfe4c49/bertopic/representation/_zeroshot.py#L72C9-L74C104
topic_descriptions
had 100 topics (my scene) and a lot of words. It was passed directly to pipline ZeroShotClassificationPipeline
without batched. I think it was the large amount of data processed by the pipeline at one time that leaded to the long running time. Maybe add an arguement batch_size
in ZeroShotClassification
could be better?
Which GPU are you using? One or the other makes quite a bit of difference.
I think it was the large amount of data processed by the pipeline at one time that leaded to the long running time. Maybe add an arguement batch_size in ZeroShotClassification could be better?
That could be a possibility but have you tested whether that is indeed the issue? You can test this out by simply adopting the code you referenced there to supply custom batches. Part of BERTopic is its modularity, so adapting it should be straightforward.
@MaartenGr I found the problem. Function __call__
of class Pipeline
in transformers has an arguement batch_size
:
https://github.com/huggingface/transformers/blob/caa5c65db1f4db617cdac2ad667ba62edf94dd98/src/transformers/pipelines/base.py#L1157C3-L1170C46
We could consider adding it in pipeline_kwargs:
https://github.com/MaartenGr/BERTopic/blob/de7376d3d42960e787a1115e5fe69fb726a7a33d/bertopic/representation/_zeroshot.py#L55C1-L74C104
Ah, in that case it is also already implemented right? Just do something like this:
representation_model = ZeroShotClassification(candidate_topics, model="./bart-large-mnli", pipeline_kwargs={"batch_size":32})
@MaartenGr Yes. The problem has been solved. Thank you so much!
For test, I just set
n_clusters=5
in KMeans and input only 20 docs.The code above ran for only a few seconds. But the code bellow:
It ran for more than 20 minutes without ending. It is too long. I set only 3 candidate topics (
ZeroShotClassification(candidate_topics[:3], model="./bart-large-mnli")
) and code ended after 4mins. So the problem may be that there are too many candidate topics. But I don't think 70 is a lot. Maybe you can optimize the parallel batch processing capability, when the pipeline of transformers is not the problem?