MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.17k stars 764 forks source link

Issue using Flair TransformerDocumentEmbeddings #1145

Closed MarkWClements closed 1 year ago

MarkWClements commented 1 year ago

I am trying to use the FinBert model with BERTopic and I've read the docs about how to do create document embeddings using flair. However, I am running into an issue that I can't figure out. Here is my sample code to reproduce the error

from flair.embeddings import TransformerDocumentEmbeddings
from flair.data import Sentence

model_object = TransformerDocumentEmbeddings('finbert', cls_pooling='cls')

# Raw input text
text = ""Thanks, Simon. One of the things that we should be aware when we're looking at the financial results that you might remember, we sold our M&A business at -- in the June '22 financial results. And so PPP does not include M&A in the financial results. I'll repeat the results in terms of revenue, EBITDA, et cetera, and then I'll go through some reasons in relation to that. The revenue as Simon said, was up at 9.29% at AUD111 million. EBITDA was down to AUD26.1 million and patent was down AUD8.3 million and gross operating cash flow, which is probably one of our major metrics, was negative 8.9% compared to PCP of 6.68%. So I'll talk about GOCF in a minute. But going through the segment results, in terms of that revenue growth at 9.29%. The PI segment actually grew by 10%, and it grew in Queensland PI. The new offices that we opened last year, the 4 new offices by [ sole ] contribution, increased contribution from Stephen Browne and our Sciaccas businesses. In terms of the NPA segment, it only grew at 5.9% this period, and that was really due to growth in our revenues contributed by our medical law and commercial dispute practice. So, while revenue was up 9.29%. EBITDA growth was flat and marginally below PCP. If you look at the segment PI segment, its EBITDA margin dropped temporarily to 22.6% from 25.5%. And that's really due to extra provisioning that will require because we didn't sell as many cases to the first half or with group and therefore, we're required to provide positioning in respect o those -- that growth in that with -- in addition, expenses were higher in this half compared to PCP. We believe that margin will normalize once case resolution activity increases and expect are normalized in the near future once we're out with this growth phase. In terms of the NPA segment, EBITDA margin dropped temporarily from 27.1% from [ 33.2% ] Again, we're investigating a whole bunch of new class actions, and we're required under the revenue standard, we're required to 100% provide all our class action investigations. Again, from a couple of views, we expect that margin to normalize once case resolution activity increases in class actions and also expenses are controlled in the near future once we're out of our growth phase."

# Create flair sentence
sentence = Sentence(text)

# Embed flair sentence
embedding = model_object.embed(sentence)

When I run this code, the last line that creates the emdeddings model_object.embed(sentence) produces this error

RuntimeError: The expanded size of the tensor (520) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 520]. Tensor sizes: [1, 512]

but when I look at the tokens in the sentence that flair creates I see that there is only 465 of them. Running

sentence.tokens

produces this

[Token[0]: "Thanks", Token[1]: ",", Token[2]: "Simon", Token[3]: ".", Token[4]: "One", Token[5]: "of", Token[6]: "the", Token[7]: "things", Token[8]: "that", Token[9]: "we", Token[10]: "should", Token[11]: "be", Token[12]: "aware", Token[13]: "when", Token[14]: "we", Token[15]: "'re", Token[16]: "looking", Token[17]: "at", Token[18]: "the", Token[19]: "financial", Token[20]: "results", Token[21]: "that", Token[22]: "you", Token[23]: "might", Token[24]: "remember", Token[25]: ",", Token[26]: "we", Token[27]: "sold", Token[28]: "our", Token[29]: "M", Token[30]: "&", Token[31]: "A", Token[32]: "business", Token[33]: "at", Token[34]: "--", Token[35]: "in", Token[36]: "the", Token[37]: "June", Token[38]: "'", Token[39]: "22", Token[40]: "financial", Token[41]: "results", Token[42]: ".", Token[43]: "And", Token[44]: "so", Token[45]: "PPP", Token[46]: "does", Token[47]: "not", Token[48]: "include", Token[49]: "M", Token[50]: "&", Token[51]: "A", Token[52]: "in", Token[53]: "the", Token[54]: "financial", Token[55]: "results", Token[56]: ".", Token[57]: "I", Token[58]: "'ll", Token[59]: "repeat", Token[60]: "the", Token[61]: "results", Token[62]: "in", Token[63]: "terms", Token[64]: "of", Token[65]: "revenue", Token[66]: ",", Token[67]: "EBITDA", Token[68]: ",", Token[69]: "et", Token[70]: "cetera", Token[71]: ",", Token[72]: "and", Token[73]: "then", Token[74]: "I", Token[75]: "'ll", Token[76]: "go", Token[77]: "through", Token[78]: "some", Token[79]: "reasons", Token[80]: "in", Token[81]: "relation", Token[82]: "to", Token[83]: "that", Token[84]: ".", Token[85]: "The", Token[86]: "revenue", Token[87]: "as", Token[88]: "Simon", Token[89]: "said", Token[90]: ",", Token[91]: "was", Token[92]: "up", Token[93]: "at", Token[94]: "9.29", Token[95]: "%", Token[96]: "at", Token[97]: "AUD111", Token[98]: "million", Token[99]: ".", Token[100]: "EBITDA", Token[101]: "was", Token[102]: "down", Token[103]: "to", Token[104]: "AUD26.1", Token[105]: "million", Token[106]: "and", Token[107]: "patent", Token[108]: "was", Token[109]: "down", Token[110]: "AUD8.3", Token[111]: "million", Token[112]: "and", Token[113]: "gross", Token[114]: "operating", Token[115]: "cash", Token[116]: "flow", Token[117]: ",", Token[118]: "which", Token[119]: "is", Token[120]: "probably", Token[121]: "one", Token[122]: "of", Token[123]: "our", Token[124]: "major", Token[125]: "metrics", Token[126]: ",", Token[127]: "was", Token[128]: "negative", Token[129]: "8.9", Token[130]: "%", Token[131]: "compared", Token[132]: "to", Token[133]: "PCP", Token[134]: "of", Token[135]: "6.68", Token[136]: "%", Token[137]: ".", Token[138]: "So", Token[139]: "I", Token[140]: "'ll", Token[141]: "talk", Token[142]: "about", Token[143]: "GOCF", Token[144]: "in", Token[145]: "a", Token[146]: "minute", Token[147]: ".", Token[148]: "But", Token[149]: "going", Token[150]: "through", Token[151]: "the", Token[152]: "segment", Token[153]: "results", Token[154]: ",", Token[155]: "in", Token[156]: "terms", Token[157]: "of", Token[158]: "that", Token[159]: "revenue", Token[160]: "growth", Token[161]: "at", Token[162]: "9.29", Token[163]: "%", Token[164]: ".", Token[165]: "The", Token[166]: "PI", Token[167]: "segment", Token[168]: "actually", Token[169]: "grew", Token[170]: "by", Token[171]: "10", Token[172]: "%", Token[173]: ",", Token[174]: "and", Token[175]: "it", Token[176]: "grew", Token[177]: "in", Token[178]: "Queensland", Token[179]: "PI", Token[180]: ".", Token[181]: "The", Token[182]: "new", Token[183]: "offices", Token[184]: "that", Token[185]: "we", Token[186]: "opened", Token[187]: "last", Token[188]: "year", Token[189]: ",", Token[190]: "the", Token[191]: "4", Token[192]: "new", Token[193]: "offices", Token[194]: "by", Token[195]: "[", Token[196]: "sole", Token[197]: "]", Token[198]: "contribution", Token[199]: ",", Token[200]: "increased", Token[201]: "contribution", Token[202]: "from", Token[203]: "Stephen", Token[204]: "Browne", Token[205]: "and", Token[206]: "our", Token[207]: "Sciaccas", Token[208]: "businesses", Token[209]: ".", Token[210]: "In", Token[211]: "terms", Token[212]: "of", Token[213]: "the", Token[214]: "NPA", Token[215]: "segment", Token[216]: ",", Token[217]: "it", Token[218]: "only", Token[219]: "grew", Token[220]: "at", Token[221]: "5.9", Token[222]: "%", Token[223]: "this", Token[224]: "period", Token[225]: ",", Token[226]: "and", Token[227]: "that", Token[228]: "was", Token[229]: "really", Token[230]: "due", Token[231]: "to", Token[232]: "growth", Token[233]: "in", Token[234]: "our", Token[235]: "revenues", Token[236]: "contributed", Token[237]: "by", Token[238]: "our", Token[239]: "medical", Token[240]: "law", Token[241]: "and", Token[242]: "commercial", Token[243]: "dispute", Token[244]: "practice", Token[245]: ".", Token[246]: "So", Token[247]: ",", Token[248]: "while", Token[249]: "revenue", Token[250]: "was", Token[251]: "up", Token[252]: "9.29", Token[253]: "%", Token[254]: ".", Token[255]: "EBITDA", Token[256]: "growth", Token[257]: "was", Token[258]: "flat", Token[259]: "and", Token[260]: "marginally", Token[261]: "below", Token[262]: "PCP", Token[263]: ".", Token[264]: "If", Token[265]: "you", Token[266]: "look", Token[267]: "at", Token[268]: "the", Token[269]: "segment", Token[270]: "PI", Token[271]: "segment", Token[272]: ",", Token[273]: "its", Token[274]: "EBITDA", Token[275]: "margin", Token[276]: "dropped", Token[277]: "temporarily", Token[278]: "to", Token[279]: "22.6", Token[280]: "%", Token[281]: "from", Token[282]: "25.5", Token[283]: "%", Token[284]: ".", Token[285]: "And", Token[286]: "that", Token[287]: "'s", Token[288]: "really", Token[289]: "due", Token[290]: "to", Token[291]: "extra", Token[292]: "provisioning", Token[293]: "that", Token[294]: "will", Token[295]: "require", Token[296]: "because", Token[297]: "we", Token[298]: "did", Token[299]: "n't", Token[300]: "sell", Token[301]: "as", Token[302]: "many", Token[303]: "cases", Token[304]: "to", Token[305]: "the", Token[306]: "first", Token[307]: "half", Token[308]: "or", Token[309]: "with", Token[310]: "group", Token[311]: "and", Token[312]: "therefore", Token[313]: ",", Token[314]: "we", Token[315]: "'re", Token[316]: "required", Token[317]: "to", Token[318]: "provide", Token[319]: "positioning", Token[320]: "in", Token[321]: "respect", Token[322]: "o", Token[323]: "those", Token[324]: "--", Token[325]: "that", Token[326]: "growth", Token[327]: "in", Token[328]: "that", Token[329]: "with", Token[330]: "--", Token[331]: "in", Token[332]: "addition", Token[333]: ",", Token[334]: "expenses", Token[335]: "were", Token[336]: "higher", Token[337]: "in", Token[338]: "this", Token[339]: "half", Token[340]: "compared", Token[341]: "to", Token[342]: "PCP", Token[343]: ".", Token[344]: "We", Token[345]: "believe", Token[346]: "that", Token[347]: "margin", Token[348]: "will", Token[349]: "normalize", Token[350]: "once", Token[351]: "case", Token[352]: "resolution", Token[353]: "activity", Token[354]: "increases", Token[355]: "and", Token[356]: "expect", Token[357]: "are", Token[358]: "normalized", Token[359]: "in", Token[360]: "the", Token[361]: "near", Token[362]: "future", Token[363]: "once", Token[364]: "we", Token[365]: "'re", Token[366]: "out", Token[367]: "with", Token[368]: "this", Token[369]: "growth", Token[370]: "phase.Â", Token[371]: "In", Token[372]: "terms", Token[373]: "of", Token[374]: "the", Token[375]: "NPA", Token[376]: "segment", Token[377]: ",", Token[378]: "EBITDA", Token[379]: "margin", Token[380]: "dropped", Token[381]: "temporarily", Token[382]: "from", Token[383]: "27.1", Token[384]: "%", Token[385]: "from", Token[386]: "[", Token[387]: "33.2", Token[388]: "%", Token[389]: "]", Token[390]: "Again", Token[391]: ",", Token[392]: "we", Token[393]: "'re", Token[394]: "investigating", Token[395]: "a", Token[396]: "whole", Token[397]: "bunch", Token[398]: "of", Token[399]: "new", Token[400]: "class", Token[401]: "actions", Token[402]: ",", Token[403]: "and", Token[404]: "we", Token[405]: "'re", Token[406]: "required", Token[407]: "under", Token[408]: "the", Token[409]: "revenue", Token[410]: "standard", Token[411]: ",", Token[412]: "we", Token[413]: "'re", Token[414]: "required", Token[415]: "to", Token[416]: "100", Token[417]: "%", Token[418]: "provide", Token[419]: "all", Token[420]: "our", Token[421]: "class", Token[422]: "action", Token[423]: "investigations", Token[424]: ".", Token[425]: "Again", Token[426]: ",", Token[427]: "from", Token[428]: "a", Token[429]: "couple", Token[430]: "of", Token[431]: "views", Token[432]: ",", Token[433]: "we", Token[434]: "expect", Token[435]: "that", Token[436]: "margin", Token[437]: "to", Token[438]: "normalize", Token[439]: "once", Token[440]: "case", Token[441]: "resolution", Token[442]: "activity", Token[443]: "increases", Token[444]: "in", Token[445]: "class", Token[446]: "actions", Token[447]: "and", Token[448]: "also", Token[449]: "expenses", Token[450]: "are", Token[451]: "controlled", Token[452]: "in", Token[453]: "the", Token[454]: "near", Token[455]: "future", Token[456]: "once", Token[457]: "we", Token[458]: "'re", Token[459]: "out", Token[460]: "of", Token[461]: "our", Token[462]: "growth", Token[463]: "phase", Token[464]: "."]

I've tried googling the cause of this error but there isn't much out there and the documentation for flair isn't great so I was hoping you could help.

Also, is there a way to use a different tokenizer here? This tokenizer splits punctuation into its own token, which I assume isn't ideal for the FinBERT document embeddings?

Thank You!

MaartenGr commented 1 year ago

Have you also tried posting this at the Flair repo? They are typically quite fast in answering any issues and are significantly more knowledgeable about that package than I am. Also, I think you can find a similar issue here.

MarkWClements commented 1 year ago

@MaartenGr I will try posting to the flair repo. I did run across the post you shared with a similar issue but the "fix" discussed in that post does not work for me.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!