Closed LuisMayo closed 11 months ago
Looking into this issue a bit more, it seems like it may be an issue on Polyglot's end - I think their method for determining sentiment might just not be particularly sophisticated or robust.
I tried the following script outside of the Objection Engine:
from polyglot.text import Text
comments = [
"This is a good happy sentence! Yay haha woohoo!",
"I hate you so much bad evil angry.",
"This sentence is neutral",
"¡Me encantan los perritos pequeños y te quiero mucho!",
"Espero que tienes dolor y muertas.",
"Este oración es neutral."
]
for i in comments:
try:
polarity = Text(i).polarity
except ZeroDivisionError:
polarity = 0
print(f'"{i}": {polarity}')
The results were:
"This is a good happy sentence! Yay haha woohoo!": 1.0
"I hate you so much bad evil angry.": -1.0
"This sentence is neutral": 0
"¡Me encantan los perritos pequeños y te quiero mucho!": -1.0
"Espero que tienes dolor y muertas.": -1.0
"Este oración es neutral.": 0
Overall, it didn't seem to do poorly here, although the Spanish sentence intended to be positive received a negative score. The division-by-zero error is definitely a problem with how the Polyglot code calculates the overall score of a sentence.
I'm unfortunately not sure how it would be best to proceed with this - it seems like it'd require a lot of work on the Polyglot side. Depending on how the bots are using the Hugging Face library, there might be ways to modify them so that they use it more efficiently and don't take up as much RAM.
Using polyglot is a solution rather than a requirement itself. If we can find another way for the bots to use less RAM that would be acceptable as well
Is the current bot code public? I found this from a few months ago: https://github.com/LuisMayo/Objection-Engine-Rabbit-Worker/blob/master/main.py#L11
Each call to render_comment_list()
will create a new instance of the Hugging Face model, which is probably where the RAM usage happens. Since the model shouldn't be changing across render requests, we can instead create one analyzer or DialogueBoxBuilder
when the bot starts up, and then reuse it (pass it new arguments to render()
) each time we want to render a new video.
Here's some pseudocode based off of my fork of this branch:
# Only run this stuff once on bot startup, so that
# these parts are not being recreated and garbage collected
# every time a new thread is rendered
sentiment_analyzer = SentimentAnalyzer()
if len(getenv("oe_bypass_sentiment", "")) <= 0:
model_setting = getenv("oe_sentiment_model", "hf")
if model_setting == "hf":
sentiment_analyzer = HuggingFaceAnalyzer()
elif model_setting == "pg":
sentiment_analyzer = PolyglotAnalyzer()
builder = DialogueBoxBuilder(
callbacks=callbacks, sentiment_analyzer=sentiment_analyzer
)
# Call this function every time we want to render a thread
# (note that the input arguments are incomplete for brevity's sake)
def render_comment_list(comment_list, output_filename, ...):
builder.render(
comment_list,
output_filename=output_filename,
music_code=music_code.lower(),
assigned_characters=assigned_characters,
adult_mode=adult_mode,
avoid_spoiler_sprites=avoid_spoiler_sprites,
resolution_scale=resolution_scale,
)
The bot code is public.
I like your idea, but we still have the problem than a single instance of the model is already taking a lot of RAM. Your idea would greatly improve renders per time unit though
Maybe when Twitter bot closes this will get alleviated and stops being a problem
How much RAM does the server have available/what would be an acceptable level of RAM usage?
I did a quick test where I rendered the same video 10 times in a row with the same sentiment analyzer, for both Hugging Face and Polyglot.
So the Hugging Face model definitely uses a lot more than Polyglot. We might be able to find a different model on Hugging Face that is smaller but still works, but it doesn't look like there are many options: https://huggingface.co/models?language=en,fr,es&sort=downloads&search=sentiment
The kinda-good news here is that I got identical results when only rendering the video once, so I'm more confident now that memory usage will not scale with the number of videos rendered, as long as the API is used correctly.
In order to merge this we need to be able to make polyglot actually detect sentiments. It currently crashes in a division by zero, but the actual reason is that all the words sentiment is getting detected as 0.
I can't download the models either (possible problem) since I get a timeout :(