NuExtract language model

NatLibFi / FinGreyLit

Data set of Finnish grey literature, containing curated Dublin Core style metadata and links to original PDF publications

18 stars 2 forks source link

NuExtract language model #9

Open juhoinkinen opened 3 days ago

juhoinkinen commented 3 days ago

There exists models finetuned for structured extraction:

NuExtract-tiny, NuExtract, and NuExtract-large

We could try how they perform in comparison to currently used LMs.

osma commented 21 hours ago

Looks very interesting, and it seems they have recently released new 1.5 models claimed to be even better: https://numind.ai/blog/nuextract-1-5---multilingual-infinite-context-still-small-and-better-than-gpt-4o

In practice, these are the new models:

numind/NuExtract-1.5-tiny (based on Qwen2.5-0.5B, 0.5B params)
numind/NuExtract-1.5 (based on Phi-3.5-mini-instruct, 3.8B params)

It appears they didn't release a 7B model in this round.

osma commented 19 hours ago

I looked a bit at the usage code and prompt template that NuExtract uses. It seems to be based on "old-fashioned" completions, not chat-style interaction with messages. That's a bit unfortunate, because our fine-tuned models are based on the chat style, and going back doesn't seem like a very good idea...

Anyway, I guess we need to test it further.