Conversion of existing text to use split words

AlefAlefAlef / ivrita

Ivrita is an open-source set of typographic tools for gender equality in Hebrew

https://alefalefalef.co.il/ivrita

GNU Affero General Public License v3.0

42 stars 4 forks source link

Conversion of existing text to use split words #4

Open idow09 opened 3 years ago

idow09 commented 3 years ago

Hi! Love the project!! 🤩 Are there any plans or thoughts about implementing a mechanism to handle (convert?) existing, old, non-neutral text? I'm a software engineer with background in ML and I'm curious if this is something you think worth the shot, and if so I would love to learn from your experience and expertise in the field so I don't go wasting my time 😅 Thanks

avrahamcornfeld commented 3 years ago

Hey Ido, Thanks for the input. This would be an amazing feature but I don't think it is possible, since the code doesn't understand the context of the words. For example the word פתח can have so many meanings - some of which need to be genderized and others don't. Here are some random examples:

אמרתי לאחי ״פתח את הדלת״
אורח/ת יקר/ה, הכנס/י דרך פתח הבניין, עלה/י במדרגות ופתח/י את הדלת
את האות אל״ף יש לנקד בסימן פתח

The only way this may work is with a very powerful AI script that knows to analyze the text...

idow09 commented 3 years ago

I realize the challenge of course. If it was an easy task I'm sure you already would have done it. But I think some research should be done on current SOTA Hebrew NLP models before concluding it as not possible don't you think? Think about it: even if the success rate of such model is not 100% (it never is...), one could utilize such tool to achieve an excellent headstart, and then go fixing the errors manually. Existing NLP models are very powerful in understanding grammar and extracting meaning from context... Let me know what you think, thanks!

kinging123 commented 3 years ago

Hi @idow09, your idea sounds pretty cool and challenging.

Currently, Ivrita is being executed on the client-side of the website, so I assume using NLP/machine learning is not relevant to the current implementation. In my understanding, in order to use the models you're referring to, Ivrita would need to run on a centralized server, to which all the texts will be sent over an API. This means that (to save loading time of each page on the website) all of the strings would need to be parsed once when publishing a page, and not on runtime like they are currently being parsed.

Although this requires a lot of work and would be used slightly differently than the current product we offer, it does sound very exciting, and we know a lot of people who really need it! If you have the passion to make this (or even just begin the work on it) - we will be glad to see and support it!