Objection engine rewrite

Meorge commented 1 year ago

Since starting to contribute to this project, there have been quite a few things I've wanted to add, but found difficult to do with the way the current video-building system is structured:

Add "Hold it!" and "Take that!" bubbles, and allow them to be inserted at any point in dialogue (#42)
Mid-dialogue pauses, sound effects, and animations for emphasis (#29)
Relatively easily make videos for other Ace Attorney games using different assets (#71)
Potential to add other shots, like jury murmurs, gavel banging, close-ups with action lines (#5)
Panning animation between the defense's bench, witness stand, and prosecutor's bench
Generally speaking, more customization/control over the dialogue, that can more closely rival the original games

I've been working on a "rough draft" of a rendering system rewrite. This is a sample export of what I have so far:

https://user-images.githubusercontent.com/9957987/201572929-8bc48f74-2047-4e73-90f1-a625d57e0808.mp4

In addition to the text colors, self-closing tags can also be used to do things like change a character sprite, play an animation, or wait a specific period of time. This gives the user the ability to make much more dynamic scenes! The downside is that because so much is controlled via the input script, the input text size becomes a lot bigger. For example, here's the text for the first dialogue box from Phoenix in the video:

<startblip male/><sprite left new_assets/character_sprites/phoenix/phoenix-normal-talk.gif/>I am going to <red>slam the desk</red><sprite left new_assets/character_sprites/phoenix/phoenix-normal-idle.gif/><stopblip/><wait 1/> <stopblip/><objection phoenix/><wait 0.8/> <phoenixslam/><wait 0.8/> I<startblip male/><sprite left new_assets/character_sprites/phoenix/phoenix-normal-talk.gif/> just did it<stopblip/><holdit phoenix/><wait 0.8/> <startblip male/>did you see that <green>was i cool</green><sprite left new_assets/character_sprites/phoenix/phoenix-normal-idle.gif/><stopblip/><showarrow/><wait 3/>?<hidearrow/><playsound pichoop/>

Most of these tags (such as startblip, stopblip, etc) could be added to the text by an intermediate step between the user input and the actual rendering, so unless a user wanted to script out actions very specifically, they probably wouldn't need to worry with most of it. I also often use Python's f-strings to create macros, such as SLAM_PHX = "<phoenixslam/><wait 0.8/>".

If you think this sounds good, I can start a new branch of Objection! and work on a more polished version! There are some major things I still need to get figured out, including reworking the tag system to better support commands and determining how the flow of action will be kept consistent. As such, it probably would not be ready for use in Objection! for quite a while. If I were to go ahead with continuing its development, though, it'd make sense to note it for the other issues listed above. That way, no one would address them, only for their changes to be wiped out by the new rendering engine.

Please let me know if you have any thoughts/questions/concerns/etc! 😄

Edit: Here's a link to my repository. The code (especially ace_attorney_scene.py, where the Ace Attorney-specific stuff happens) is a bit of a mess, sorry 😅 https://github.com/Meorge/ObjectionRenderer2Rough

LuisMayo commented 1 year ago

Hi

This seems like a good idea. Inlining command in texts is usually done. It's basically how HTML works. I know Pokemon games work the same.

My main issues are two, but they're solvable (probably):

We have to make sure the parser works fine. It shouldn't crash and it should handle edge cases
The library may have to provide a way to escape/sanitize a text in case the "intermediate step" (the caller) wants to avoid users from writing code.

Apart from that I think the system may be a good idea and will help with future functionalities of objection engine.

I'll go ahead and mark those issues... not sure how but I'll mark them :P

I'm also gonna pin this issue.

Please use this issue to explain any doubts you find.

Meorge commented 1 year ago

Great, thanks! I'll try to get started on the revamped parser later this week - I've got an idea on how it would work, but need to get some time to experiment with it and figure out the fine details.

Meorge commented 1 year ago

Got some good progress done today - actions are no longer tied to specific strings, so I don't have to do weird stuff with spaces anymore!

https://user-images.githubusercontent.com/9957987/202365153-7a82f80d-1d90-46b6-976d-170854ec294e.mp4

I still need to get it working with multiple pages/boxes of dialogue.

Meorge commented 1 year ago

I'm continuing to make slow but steady progress. We now have multiple pages, the prosecutor's bench, and some more effects (shake and flash)!

https://user-images.githubusercontent.com/9957987/202835142-74490a47-4b8a-4b18-9455-3adc6f045dc4.mp4

I've been working on it in my separate repository up to now, but within the next few days I'm planning to move the code to my fork + branch of Objection so that more of its progress will be logged in the repo's commit history.

As for the issues you brought up, I definitely need to make sure that string-to-number conversions don't crash the program, although I think it may be good for them to print a warning to the user and/or emit some sort of signal so that a calling application knows there was an issue. I still need to figure out how to make tags escapable...

LuisMayo commented 1 year ago

If you don't know how to do something (as the escape tags thing) you can open a MR maybe I can take a look to it. I was thinking to just do a classical programming backslash \ to escape all reserved characters (<>/)

I don't think it would be strictly necessary for implementing it anyway

Meorge commented 1 year ago

For the tag-escaping: I'm currently omitting escaped tags via a negative lookbehind: (?<!\\)<(/?)(.+?)(/??)> This works, except the backslash is still visible in the displayed text. It may be that this is a spot where something other than regular expressions would work fine (or maybe even better).

It's still got a ways to go before it's ready to replace the existing renderer, but the API/user function calls are starting to look more like what we're expecting. Here's an exported video and the current function calls for it:

https://user-images.githubusercontent.com/9957987/203684175-c1bdd68d-01fc-4395-b0a6-65cbbdc618dc.mp4

pages: list[DialoguePage] = []

pages.append(DialoguePage([DialogueAction("music start cross-moderato", 0)]))
pages.extend(get_boxes_with_pauses(
    user_name="Phoenix",
    character="phoenix",
    text="Hello it is I, Phoenix Wright. I am saying some lines of text."
))
pages.extend(get_boxes_with_pauses(
    user_name="Edgeworth",
    character="edgeworth",
    text="And I am the antagonist, Edgeworth. I am also saying some lines of text. Here is a third line, because I am very serious."
))
pages.extend(get_boxes_with_pauses(
    user_name="Gumshoe",
    character="gumshoe",
    text="Hey, pal! I'm on the witness stand! Ain't that cool? Woah ho ho look at me!"
))
pages.extend(get_boxes_with_pauses(
    user_name="Judge",
    character="judge",
    text="And I'm over here, on another screen! Isn't that just the neatest thing?"
))

pages.extend(get_boxes_with_pauses(
    user_name="Lotta",
    character="lotta",
    text="Don't forget about me, m'kay? I'm still here too."
))

director = AceAttorneyDirector()
director.set_current_pages(pages)
director.render_movie(-15)

Things that still need to be done for the renderer itself, especially issues visible in this video:

[x] Objects in wide court shot
- [x] Defense bench
- [x] Prosecutor bench
- [x] Witness stand
[x] Improve text wrapping
- [x] Words should wrap sooner than they do right now
- [x] Add a pause after a box ends, if one character's dialogue spans more than one box
[x] RTL support (just realized this rewrite doesn't have the RTL updates that my previous pull request added to the older renderer, haha 😅 )
[x] Presenting evidence

Once these are done, we'll be able to work more on the API. And once that's done, we can start working on nice little features, like:

[x] Pauses after the ends of sentences
[x] Analyze sentences' sentiment and change sprites mid-box based on them
[x] Analyze the sentiment of words within a sentence and add impact effects/text coloring
[ ] Action line close-up shots, judge gavel smash, courtroom mumbling, etc

My branch with this code is at https://github.com/Meorge/objection_engine/tree/v4. It uses a new assets folder at https://drive.google.com/drive/folders/1-2NV-wpXPovJXW3KaI34AXRZFsrqm1oM?usp=sharing. I may be making changes to the contents and structure of this folder, so until we have an official release it probably doesn't make sense to host it somewhere else. I'd be more than happy to submit a PR for the code if you would like, but I don't know if it's complete enough for that to make sense yet, haha.

Meorge commented 1 year ago

Now that school's out for the year, I have more time to spend working on this! 🥳 One somewhat-major change I have gone with is with making the HTML-style tag system optional. Before, the tags and text were "compiled" into lists of command objects that stated in what sequence they should be evaluated. Converting from "raw" text (as we might have scraped from Discord or Twitter) would follow a few steps:

Figure out where to place the HTML-style tags in the raw text string, and place them there, creating a new string
Compile the string into the list of command objects
Run the engine on the list of objects

This change cuts out step 1 by converting the raw text directly into the list of command objects. Not only is it more efficient, but I think it should also make it easier/more intuitive to code! I didn't remove any of the code for the HTML-style tag functionality. So if a user wanted to place commands at specific points or color specific chunks of text, they could do that too.

Meorge commented 1 year ago

Evidence is now in! It also tries to detect proper nouns using Polyglot, and gives them a 50% chance of being colored red.

https://user-images.githubusercontent.com/9957987/208227044-45a53326-10cc-4740-ad88-0f0232c6065b.mp4

I also used the rich library to spruce up the terminal output:

https://user-images.githubusercontent.com/9957987/208227009-7a471b02-d86d-4b79-8b82-56c27f210672.mov

At this point, I don't think there's a whole lot more that'll be necessary before submitting the pull request. I'll see if I can get the sentence wrapping to work a bit better, and then I think it's mostly just moving the rest of the character sprites over to the new system.

LuisMayo commented 1 year ago

The work you're doing here is gigantic Meorge. Things look so much better and that's using only your PoC.

The terminal output is also perfect

To be honest I even feel a bit bad you're doing do much work without help. Thanks a lot. I'll sure users will appreciate it once it reaches prod

Meorge commented 1 year ago

Thanks so much! 😄

I'm having fun working on it, so no worries. I suppose, if anything my main concern is how readable it will be for other maintainers. One of the main goals with this rewrite was to make it more modular and extensible, so that in the future others could add new features and fixes more easily than the current system. I've got a pretty good understanding of how this new system works, but I think it'll be good for me to thoroughly document the architecture and interface so that others understand how to work with it.

Beyond that, I'd also like to see what can be optimized before publishing the v4 engine. As can be seen in the terminal video there, it took about 56 seconds to render a 50-second video. I'll have to compare it to the current engine to be sure, but I'm pretty sure that's a lot more time than it would take.

Meorge commented 1 year ago

At this point, I'm feeling like it is quite close to ready for an alpha/beta release! I've written some documentation on the basic structure of the input, as well as tutorials for creating custom characters and music packs.

A few things I think would be good to have before the alpha/beta release:

[ ] Some kind of probability curve for determining when to pan the camera instead of cutting. Currently the camera will always pan between the left, center, and right sides of the courtroom, causing a lot of dead time. I'm thinking it should start with the camera pan probability at 1.0 - then, when the camera pans, it gets set to a low value (say, 0.1) and then slowly creeps back up towards 1.0 (perhaps more quickly, the longer the camera stays fixed in one position).
[ ] Better/more flexible sentiment analysis. Polyglot is great for being able to recognize so many languages in theory, but so far I've been finding it doesn't do a very good job with analyzing the English text I've given it. I think it might be worth looking into other models, such as some found on Hugging Face specifically trained on "internet speak". If there's a way we can prioritize one but fall back to another, that would be best.

LuisMayo commented 1 year ago

About performance. Current engine it's not exactly fast. In fact I was thinking about rewriting parts to rust but it was too much work. So don't worry about it

The probability curve seems like a good idea. Panning it all the time is definitely not the best option since it will just be less impactful than just switching immediately

I will try if I find the tume the hugging face model. If it performs better than Textblob it'd be supper. If it performs better than polyglot we could maybe even try LibreTranslate+Hugging Face and ditch polyglot entirely. But this for sure needs further testing before making any decisions

Meorge commented 1 year ago

Sounds good on the performance - I won't worry too much with it for now, we'll see how the engine performs and we can make further optimizations if/when necessary.

Regarding the probability curve, do you think that might be a good thing to hold off on for a separate PR? It'd be easy enough to temporarily disable the panning code for now and just have it always cut instead, replicating the behavior of the v3 engine. Similarly, it might make sense to just use the current sentiment analyzers for the first release of v4, and then work on improved analysis once that's out there.

EDIT: Great news - After a little bit more investigating ~~(aka reading the document more closely)~~ I discovered that there's a multilingual sentiment analysis model available through Hugging Face as well! https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment

It is said to have been fine-tuned on eight languages: (Arabic, English, French, German, Hindi, Italian, Portuguese, and Spanish) but the examples and paper suggest that it should support more languages! 😄 I tried implementing it in v4 and it seems to be working well. The model is around 1.1 GB, but once it's downloaded it should be able to detect sentiment very quickly.

LuisMayo / objection_engine

Objection engine rewrite #97