Darazaki / Spedread

GTK speed reading software: Read like a speedrunner!
GNU General Public License v3.0
65 stars 6 forks source link

Word segmentation support #16

Open GrimPixel opened 1 year ago

GrimPixel commented 1 year ago

There are other languages than Japanese that need word segmentation https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Text-Processing-Tools#Word_Segmentation

Darazaki commented 1 year ago

Hi and sorry for the wait. This looks like a great resource thanks! I really underestimated how big of a task segmenting words would be

I was hopping what I suggested in #5 would suffice. But now the better approach seems to be to completely rework the way words are read by Spedread

Maybe something like:

when start_reading_button.pressed:
    chunks = user_text.split_by_language()

    for language, text_chunk in chunks:
        if language.requires_word_segmentation:
            words = language.get_nlp_library().parse(text_chunk)
        else:
            words = text_chunk.split_by_spaces()

What do you think?

I'll also ask the opinion of one of my colleague who does NLP stuff next week to see if that's reasonable

GrimPixel commented 1 year ago

Great to hear that! I think users can choose their own word segmentation engine. Just place engines in a folder and program a file that calls the engine to segregate the sentences.

Darazaki commented 1 year ago

Good idea! If I end up going with that idea I'll see what would be the best format for these libraries later (maybe .so/.wasm or Python scripts idk)