Work in progress in readding polyglot support

LuisMayo commented 1 year ago

In order to merge this we need to be able to make polyglot actually detect sentiments. It currently crashes in a division by zero, but the actual reason is that all the words sentiment is getting detected as 0.

I can't download the models either (possible problem) since I get a timeout :(

Meorge commented 1 year ago

Looking into this issue a bit more, it seems like it may be an issue on Polyglot's end - I think their method for determining sentiment might just not be particularly sophisticated or robust.

I tried the following script outside of the Objection Engine:

from polyglot.text import Text
comments = [
    "This is a good happy sentence! Yay haha woohoo!",
    "I hate you so much bad evil angry.",
    "This sentence is neutral",

    "¡Me encantan los perritos pequeños y te quiero mucho!",
    "Espero que tienes dolor y muertas.",
    "Este oración es neutral."
]

for i in comments:
    try:
        polarity = Text(i).polarity
    except ZeroDivisionError:
        polarity = 0
    print(f'"{i}": {polarity}')

The results were:

"This is a good happy sentence! Yay haha woohoo!": 1.0
"I hate you so much bad evil angry.": -1.0
"This sentence is neutral": 0
"¡Me encantan los perritos pequeños y te quiero mucho!": -1.0
"Espero que tienes dolor y muertas.": -1.0
"Este oración es neutral.": 0

Overall, it didn't seem to do poorly here, although the Spanish sentence intended to be positive received a negative score. The division-by-zero error is definitely a problem with how the Polyglot code calculates the overall score of a sentence.

I'm unfortunately not sure how it would be best to proceed with this - it seems like it'd require a lot of work on the Polyglot side. Depending on how the bots are using the Hugging Face library, there might be ways to modify them so that they use it more efficiently and don't take up as much RAM.

LuisMayo commented 1 year ago

Using polyglot is a solution rather than a requirement itself. If we can find another way for the bots to use less RAM that would be acceptable as well

Meorge commented 1 year ago

Is the current bot code public? I found this from a few months ago: https://github.com/LuisMayo/Objection-Engine-Rabbit-Worker/blob/master/main.py#L11

Each call to render_comment_list() will create a new instance of the Hugging Face model, which is probably where the RAM usage happens. Since the model shouldn't be changing across render requests, we can instead create one analyzer or DialogueBoxBuilder when the bot starts up, and then reuse it (pass it new arguments to render()) each time we want to render a new video.

Here's some pseudocode based off of my fork of this branch:

# Only run this stuff once on bot startup, so that
# these parts are not being recreated and garbage collected
# every time a new thread is rendered
sentiment_analyzer = SentimentAnalyzer()
if len(getenv("oe_bypass_sentiment", "")) <= 0:
    model_setting = getenv("oe_sentiment_model", "hf")
if model_setting == "hf":
    sentiment_analyzer = HuggingFaceAnalyzer()
elif model_setting == "pg":
    sentiment_analyzer = PolyglotAnalyzer()

builder = DialogueBoxBuilder(
    callbacks=callbacks, sentiment_analyzer=sentiment_analyzer
)

# Call this function every time we want to render a thread
# (note that the input arguments are incomplete for brevity's sake)
def render_comment_list(comment_list, output_filename, ...):
    builder.render(
        comment_list,
        output_filename=output_filename,
        music_code=music_code.lower(),
        assigned_characters=assigned_characters,
        adult_mode=adult_mode,
        avoid_spoiler_sprites=avoid_spoiler_sprites,
        resolution_scale=resolution_scale,
    )

LuisMayo commented 1 year ago

The bot code is public.

I like your idea, but we still have the problem than a single instance of the model is already taking a lot of RAM. Your idea would greatly improve renders per time unit though

Maybe when Twitter bot closes this will get alleviated and stops being a problem

Meorge commented 1 year ago

How much RAM does the server have available/what would be an acceptable level of RAM usage?

I did a quick test where I rendered the same video 10 times in a row with the same sentiment analyzer, for both Hugging Face and Polyglot.

Polyglot used 0.43 GB of RAM
Hugging Face used 2.45 GB of RAM

So the Hugging Face model definitely uses a lot more than Polyglot. We might be able to find a different model on Hugging Face that is smaller but still works, but it doesn't look like there are many options: https://huggingface.co/models?language=en,fr,es&sort=downloads&search=sentiment

The kinda-good news here is that I got identical results when only rendering the video once, so I'm more confident now that memory usage will not scale with the number of videos rendered, as long as the API is used correctly.

LuisMayo / objection_engine

Work in progress in readding polyglot support #107