Better navigation, stop/pause button, updated transformers.js, threshold input.

varunneal commented 1 year ago

No need to merge if unwanted. Maintained to style goal of sticking to one index.html file.

varunneal commented 1 year ago

Here's the live site using github pages: https://varunnsrivastava.github.io/SemanticFinder/

do-me commented 1 year ago

This is so awesome, great work! Love the sidebar with the clickable results and it's even mobile-friendly 🎉 The threshold feature is fantastic to play around with!

Will merge it soon - first, I want to set up GitHub pages for the repo too (facilitates the workflow and deployment) and remove the site from my personal homepage.

Some very minor (mostly CSS-related) things:

a small margin/padding needed between CodeMirror textarea and buttons:
progress number and bar should be on one line if possible
progress bar should be full textarea width (seems like some conflicting CSS classes or flexbox stuff)
semantically, bootstrap differentiates between primary and secondary buttons. Submit should be primary (the dark blue color nudging the user to click it) while the prev and next buttons should be secondary (just a blue outline). This guides the user better.
with two parameters, the user might be overwhelmed (chars & threshold), so I guess we can't avoid a simple explanation maybe below the app for curious users
a heading "Results" on the right would be useful, else one might wonder why the textarea is not centered

If you find a spare minute, feel free to modify any of the bullet points.

Update: Merged meanwhile.

varunneal commented 1 year ago

Thanks for looking at my code! I'll take a look at these bullet points. My broader goal, which I've spent a few hours trying already, is to do automatic semantic segmentation in browser. That is, the parsing is completely automatic, with highlights based on textually relevant phrases. For that, "# chars" should be able to be dropped as a controllable parameter.

I was finding a floating "Results" title ugly but feel free to add it. There should already be a bootstrap column where it can go.

do-me commented 1 year ago

Fantastic!

For the "automatic segmentation" part I have a few links - but unfortunately no definitive answer. It's somewhat dependent on what you're looking for, how long your input text is, how much time you have and what you're aim is (find keywords or paragraphs?).

There is e.g. langchain JS with RecursiveCharacterTextSplitter that could come in handy.

Else, just to compare how other communities deal with it, there is haystack in Python.

It pretty much boils down to finding some kind of boundary (paragraph, sentence, word, character) or if the text split by the previous boundary still exceeds the input length for the model, the next finer boundary is chosen to reduce the segment length. Is that what you had in mind?

I really like the idea of automizing these complex things to make it easier for laypeople. Maybe it would be nice to have "auto-mode" and "advanced settings" where you can fine tune all the parameters.

varunneal commented 1 year ago

I might end up using an approach similar to RecursiveCharacterTextSplitter but the broader point of Semantic Segmentation is to specifically highlight/identify semantically relevant phrases. As an example, the current implementation selects.

So Hansel and Grethel sat by the fire, and at noon they each ate their pieces of bread. They thought their father was in the wood all the time,

Splitting the text by sentences might give us

So Hansel and Grethel sat by the fire, and at noon they each ate their pieces of bread. They thought their father was in the wood all the time,

which is an improvement. Splitting by commas, e.g. recursively, could get us the desired

So Hansel and Grethel sat by the fire, and at noon they each ate their pieces of bread. They thought their father was in the wood all the time,

The basic algorithm I'm considering is:

coarsely segment the text at a sentence/paragraph level
recursively segment individual pargraphs/sentences as clauses (along commas, semicolons, other punctuation)
combine contiguous groups of segments into a single "highlight"

do-me / SemanticFinder

Better navigation, stop/pause button, updated transformers.js, threshold input. #4