UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.01k stars 2.45k forks source link

Fine-tune the transformers to financial documents like Invoices? #534

Closed NISH1001 closed 3 years ago

NISH1001 commented 3 years ago

I was experimenting with a few variants of BERT to test for semantics. However, they are what they were trained on. So, I was wondering if these transformers work on training on short phrases (like key-value pairs) to compare the semantics. Although the training from sentence-transformers provides a way to give in a list of sentence-pairs, I am not sure if it has been done...

I have two approaches in my mind

I)

Treating neighbors as semantically similar values in the document graph, train the pairs. In that, add negative samples (far away nodes) too.

II)

Instead of neighbours, use LABEL Type (probably a schema in a general sense) (say "Seller Name") as one token in the pair and the corresponding node as 2nd token in the pair. Treat them as semantically similar sentence pairs and train.

Not sure, if this works or not.

Anyway, glad to experiment and tinker more.

nreimers commented 3 years ago

Hi @NISH1001 I think a classical classification approach would be more suitable. Here, you could of course use BERT: https://www.sbert.net/examples/training/cross-encoder/README.html

Option 2 might work.

NISH1001 commented 3 years ago

Hi @nreimers

Thanks for the reply. Do you think we can make use of the bounding box position of the words in any way...in the training process? My goal is to make some type of "semantic scorer" between two words in the document.. That might help in downstream tasks (say information extraction)...

I guess that's not possible solely using these LMs... I tried fine-tuning distilroberta-base-msmarco-v1, but the performance became worser...

nreimers commented 3 years ago

Hi @NISH1001 Yes, I think having info on the position on the invoice can be helpful. But I think this framework is the wrong choice here.

So what you can do is: Take the content + position on the invoice (like top-left, bottom etc.) and pass it to a classifier (BERT or Neural Net or XGBoost) to classify it into the categories you have (date, sender, recipient, invoice number).

You might also want to add some global information on top of it, e.g. sender and recipient information is only available from a global perspective.

NISH1001 commented 3 years ago

@nreimers Thanks. I think I will try using BERT embeddings + position encoding with siamese networks to see how it goes. Again, thanks for the reply. I will close this issue. Just wanted to have some discussions. Plus, having tried GNN, I am actually not impressed by GNN performance either. They work well in balanced data. But documents with high background nodes, they simply didn't work. Anyway, I wil try to hook BERT with some simpler network to see if it gives a good performance. In my previous experiments, I did use flattened GNN-like features to lightgbm with fasttext embeddings + position features + custom textual features.

Again, thanks for the reply.