JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.86k stars 711 forks source link

Need guidance to finetune BertSentenceEmbedding using domain specific pair of sentences #13300

Closed srimantacse closed 1 year ago

srimantacse commented 1 year ago

This is not a proper feature request; rather I need the guidance to build our customized model using BertSentenceEmbedding which would be built on top of pretrained model for ex: small_bert_L2_128; I will use some domain specific dataset to finetune the mentioned model. Request to share the approach in spark-nlp perspective.

srimantacse commented 1 year ago

@maziyarpanahi Request to give input on the above query.

maziyarpanahi commented 1 year ago

Hi,

The fine-tuning of any transformer models must happen outside Spark NLP. The Java APIs for TensorFlow (or any other DL framework) don't have the feature to fine-tune, only Python has. So you pick a model in HuggingFace, let's say BERT Small (or Tiny, or any size), you will fine-tune it for the next sentence prediction over your own domain-specific dataset (they have examples of how to do NSP and there are many online tutorials), and once you are done you will import that model into BertSentenceEmbeddings annotator which you can use for training classifiers or any other tasks but at scale with zero-code change when you go from 1 machine to N machines.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days