Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.34k stars 89 forks source link

Ollama integration #30

Closed Kydlaw closed 4 months ago

Kydlaw commented 4 months ago

Description

Description of the feature: Provide the ability to use a Ollama in SemanticIngestionPipeline (currently it only supports proprietary models).

This way it would be possible to use the semantic parsing without spending money on a proprietary model.

Why the feature should be added to openparse (as opposed to another library or just implemented in your code): The interface already exists in this library (similar feature) and I'm not aware of a straightforward way to go around the existing code to inject this feature into the current openparse.

I can contribute this feature if this interest you.

Filimoa commented 4 months ago

This is a duplicate of #8. You can track progress in #23 - the main difficulty of doing this is we currently use a hard coded similarity that works well for OpenAI's models. But each embedding models will have it's own optimal cutoff. There's a couple approaches of dealing with this:

1. Start using a percentile cutoff.

This is the approach that llang-chain and llama-index use. In my limited testing, finding the optimal cutoff is still not trivial and I found it to perform worse than a hard coded approach.

We could offload choosing this to the user, but the library aims to have opinionated defaults.

2. Figure out cutoff dynamically

We would generate examples of text that should / shouldn't be combined and use this to figure out a similarity threshold.

similar_pairs = [("very similar text", "continuation"), ...]

similarities = []
for text1, text2 in similar_pairs:
    sim = get_similarity(text1, text2)
    similarities.append(sim)

avg_cutoff = ...

While this is kind of dirty, this is the approach I'm currently leaning to.

Kydlaw commented 4 months ago

I apologize for the duplicate (didn't see the links to #21 and #23... in #8)

Ok, I see and understand the problem. It is indeed very hard to provide good defaults on that. I'll have a look at your progress in #23 and see if I can maybe suggest something.

I'm closing this issue as it doesn't provide anything useful.