UBC-Thunderbots / Software

Robot Soccer Playing AI
http://www.ubcthunderbots.ca
GNU Lesser General Public License v3.0
47 stars 98 forks source link

Fixed Replay Issues #3204

Open Mr-Anyone opened 1 month ago

Mr-Anyone commented 1 month ago

Description

resolves #3154

I implemented essentially two features:

  1. To find the end time, I reversed iterate both the chunks and the log entires until I find the first valid one.
  2. To deal with field test replay files, I subtract the timestamp with the start timestamp to re index all timestamp to 0 for ProtoPlayer to play those replay files!

Testing Done

Ran the following script. The two zipfiles came from the issues. Checkout this branch and run the following four commands.

wget https://github.com/UBC-Thunderbots/Software/files/14733831/proto_2024_03_23_230504.zip -P /tmp
wget https://github.com/UBC-Thunderbots/Software/files/14546417/broken_proto_logger.zip -P /tmp
unzip /tmp/proto_2024_03_23_230504.zip -d /tmp
unzip /tmp/broken_proto_logger.zip -d /tmp
./tbots.py run thunderscope  --blue_log /tmp/proto_2024_03_23_230504
./tbots.py run thunderscope  --blue_log /tmp/proto_2024_03_09_053131

Resolved Issues

resolves #3154

Length Justification and Key Files to Review

N/A

Review Checklist

It is the reviewers responsibility to also make sure every item here has been covered

Mr-Anyone commented 1 month ago

Background

I would like to explain some design decision I made with regards to playing field test fixtures. Before I do so, please take a look at the following replay file for a field test :

This is the last 1000 lines

  1. https://pastebin.com/bD9RngKh

For context: the pastebin file is organized in timestamp||proto_class||message. The "====" is used to indicate new message.

Why did I filter out proto class that does not have the time stamp field?

If you look closely at the first 1000 lines, the timestamp (in this case 1711235119.596471) given in the replay file are all the same. Therefore, we cannot rely on that specific timestamp; we know for sure that is not right. I therefore eliminated all of those protobufs.

However, protobufs with the "epoch_timestamp_seconds" field seems to yield the correct time. To play those protobufs, we just re-index all of those protobufs such that the timestamp starts at 0 and not some arbitrary unix timestamp.

nimazareian commented 1 month ago

Great finds! I tried the replay broken files again and they work well now, though I am getting the following error when running the second replay (note that the replay still runs fine even with this error):

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nima/.cache/bazel/_bazel_nima/d4d8c6a2a61c53de6e279fde073469b4/execroot/__main__/bazel-out/k8-fastbuild/bin/software/thunderscope/thunderscope_main.runfiles/__main__/software/thunderscope/replay/proto_player.py", line 557, in __play_protobufs
    ) = ProtoPlayer.unpack_log_entry(
  File "/home/nima/.cache/bazel/_bazel_nima/d4d8c6a2a61c53de6e279fde073469b4/execroot/__main__/bazel-out/k8-fastbuild/bin/software/thunderscope/thunderscope_main.runfiles/__main__/software/thunderscope/replay/proto_player.py", line 293, in unpack_log_entry
    proto = proto_class.FromString(base64.b64decode(data[len("b") : -len("\n")]))
google.protobuf.message.DecodeError: Error parsing message with type 'TbotsProto.World'

Quickly looking over your solution, I think it would work moving forward as well, but I think we should take a more proactive approach of fixing the root cause of the replay file lines being corrupt, rather than just fixing the corruption. From what you've mentioned, the problem for the field test replays is related to the timestamp being the same through out the file. When checking what the epoch time is, I get datetime.fromtimestamp(1711235119.596471) -> datetime.datetime(2024, 3, 23, 16, 5, 19, 596471), which is the same day (though not the exact same time?) as when the first recording was taking. This likely shows that the protologger started logging with the initial time the field test had started at, and its time was never updated. ProtoLogger uses a time_provider callback function to get the time to use for each protobuf line. The time provider for the field test fixture is defined here: https://github.com/UBC-Thunderbots/Software/blob/3266eb8bcad093d7a5ebd71cf0e1c3dbe924fd0b/src/software/field_tests/field_test_fixture.py#L101-L105

and the time value is updated based on the time of the latest vision packet: https://github.com/UBC-Thunderbots/Software/blob/3266eb8bcad093d7a5ebd71cf0e1c3dbe924fd0b/src/software/field_tests/field_test_fixture.py#L133-L138

Looking at the pastebin you've provided, it does seem like the vision packet time (t_capture) is being updated. This would may indicate that self.ssl_wrapper_buffer isn't being properly updated with the latest wrapper packets, though the problem could be something else.

In any case, it may be better if we just update self.timestamp to be time.time() instead of t_capture (then we can probably remove ssl_wrapper_buffer as well). The main reason we have the time_provider callback function as opposed to using time.time() directly in ProtoLogger is to support simulated tests running as fast as possible depending on the computer it is running on (e.g. running faster in CI), but have the replay files running in realtime.

cc @PhilD71 as he's more familiar with the field test fixture and may have additional context.


Another useful update would be to have better error handling in or around ProtoPlayer.unpack_log_entry: https://github.com/UBC-Thunderbots/Software/blob/1528173facd4b72d7fd4265e99da5ca9e26b1e3d/src/software/thunderscope/replay/proto_player.py#L128-L155

We should be able to detect if log entry is corrupt (could include the protobuf being corrupt and we're unable to deserialize it, or maybe the timestamp being invalid) and skip over that log entry cleanly (and maybe just log that there was a corrupt line). Proper error handling may not be trivial given how the ProtoPlayer is currently written, but I think it's the right step forward for making our replays more robust to more general errors than just checking the timestamp being above some value.


Ideally with fixing the bug that resulted in broken replay files, and a more robust ProtoPlayer, we should be able to replay the all future replays no matter where they are generated from. This way, we shouldn't need specific code to check whether the replay file is from a field test, etc. Also, note that the goal of this ticket isn't to just get those replay files running, since we don't actually care about their content, the goal should be to be more resilient to the same type of corruptions that broke those files.

nimazareian commented 1 month ago

Also, don't forget to run ./formatting_script/fix_formatting.sh before you push

itsarune commented 1 month ago

I took a look at the broken field test fixture replay log and I'm not sure if it's incorrect. Looking at the timestamps, it is somewhat realistic to have multiple messages attached to the same timestamp. We get vision data every 0.5 seconds (30 Hz) and we expect the timestamp to update every 1 second or so. Our AI sends a ton of messages so it makes sense that a lot of messages would be attached to the same timestamp.

I think we need to get longer field_test logs to be able to really verify this correctly

nimazareian commented 1 month ago

I took a look at the broken field test fixture replay log and I'm not sure if it's incorrect. Looking at the timestamps, it is somewhat realistic to have multiple messages attached to the same timestamp. We get vision data every 0.5 seconds (30 Hz) and we expect the timestamp to update every 1 second or so. Our AI sends a ton of messages so it makes sense that a lot of messages would be attached to the same timestamp.

I think we need to get longer field_test logs to be able to really verify this correctly

I'm not sure if I fully follow your intuition here. We get vision at 60hz (I also confirmed this by looking at a second of the provided field test replay file and it had 62 SSL_WrapperPackets), which would be every ~0.016 second. Not sure why the timestamp would update every 1 second?

Though I wrote a script to read the replay file, similar to the one @Mr-Anyone likely used, and did notice that the timestamp is actually increasing as you mentioned. So the problem was not that. If you're curious, here's the script I used and the output I got. Vincent's pastebin wasn't actually the last 1000 lines of the replay, it was rather the last 1000 lines of the parsed replay which included many new lines for printing the protobufs.

The reason that field test files fail to parse appear to be that since we rely on the time from vision, the first few saved logs may not have a time from vision since we havent received a SSL_WrapperPacket yet. As a result, the time used for the first few lines is 0 (which is much less than epoch) so proto player thinks the replay has a length of epoch (~54 years?!). Every now and then in middle of the file as well, the wrapper packets time is also 0 for some odd reason (proto.detection.t_capture=0.0). Overall, I think these problems will go away if start using wall time (time.time or even better time.time_ns() for higher precision) instead of relying on the time from vision to organize our replays.

There was also a problem I ran into where the final line of the replay is empty, so we can't split it using our delimeter. Though that could be a bug in my script and how Im reading the line. In either case, stuff like this is the reason I'm advocating for having better exception handling in proto player (e.g. wrapping the .split and the proto_class.FromString with try-except, though the challenge may be updating the code to handle scenarios where ProtoPlayer.unpack_log_entry doesn't return a value due to an exception)

itsarune commented 1 month ago

You're right @nimazareian

I must've fried my brain yesterday after hiking. 60 Hz means a packet every 0.02 s. In that case, these changes make sense

itsarune commented 1 month ago

That script is fairly useful I think for debugging proto logger issues. Maybe we should consider adding it onto the repo

Mr-Anyone commented 1 month ago

New update:

  1. ProtoPlayer is still able to play when 10% of the log entry is corrupted. This involves missing delimeter and actual corrupted data. See below to see how it is generated.
  2. I removed source code that concerns invalid field testing.

Here is a script that generated 10% corrupted log entries:

import random
import base64
from software.thunderscope.replay.replay_constants import *
from software.thunderscope.replay.proto_player import ProtoPlayer
import gzip
import os

def create_invalid_log_entries(proto, current_time, frequency=0.1):
    some_random_value_between_one_and_zero = random.random()

    # remove some delimeter
    if some_random_value_between_one_and_zero < frequency/2: 
        # intentionally corrupt some entries 
        serialized_proto = base64.b64encode(proto.SerializeToString())
        log_entry = (
            f"{current_time}"
            + f"{proto.DESCRIPTOR.full_name}{REPLAY_METADATA_DELIMETER}"
            + f"{serialized_proto}\n"
        )
        return log_entry
    # create invalid/corrupt data
    elif some_random_value_between_one_and_zero < frequency:
        # intentionally corrupt some entries 
        serialized_proto = base64.b64encode(proto.SerializeToString())
        log_entry = (
            f"{current_time}{REPLAY_METADATA_DELIMETER}"
            + f"{proto.DESCRIPTOR.full_name}thisissomethingthatwouldbebad{REPLAY_METADATA_DELIMETER}"
            + f"{serialized_proto}thisisalsosomethingbad\n"
        )
        return log_entry

    # intentionally corrupt some entries 
    serialized_proto = base64.b64encode(proto.SerializeToString())
    log_entry = (
        f"{current_time}{REPLAY_METADATA_DELIMETER}"
        + f"{proto.DESCRIPTOR.full_name}{REPLAY_METADATA_DELIMETER}"
        + f"{serialized_proto}\n"
    )
    return log_entry

if __name__ == "__main__":
    path_to_recover = input("path to valid protobufs to corrupt: ")
    save_path = input("path to save: ")
    os.makedirs(save_path, exist_ok=True)

    sorted_chunks = sorted(os.listdir(path_to_recover))
    count = 0 
    for file in sorted_chunks:
        path_to_file = os.path.join(path_to_recover, file)
        print(f"Writing file: {path_to_file}")
        chunks = ProtoPlayer.load_replay_chunk(path_to_file)

        with gzip.open(os.path.join(save_path, f"{count}.replay"), "wb") as log_file:
            for log_entries in chunks:
                timestamp, proto_class, message = ProtoPlayer.unpack_log_entry(log_entries)
                log_entries = create_invalid_log_entries(message, timestamp)
                log_file.write(bytes(log_entries, encoding='utf-8'))
            count += 1

This is an output where 10 percent of the code is invalid.

10percentcorrupt.zip

Please note that the program above all entries in the replay file to be valid before use.

Testing

Run thunderscope with the zip file attached from above.

nimazareian commented 1 month ago

That script is fairly useful I think for debugging proto logger issues. Maybe we should consider adding it onto the repo

That's a good idea. @Mr-Anyone I've cleaned up the script and put it in this branch. Could you include it in your PR: https://github.com/nimazareian/Software/commit/17bd68131cd3402d50ee32665fb525e03be44fc8 Note that for some reason I ran into a weird error when trying to format this script using fix_formatting.sh, which may cause issues in CI, though it might've just been an error on my end.


It's great that you were able to test with a corrupted replay file and make the proto player more robust! I'm not sure how difficult it is, but if it's straight forward, could you add a pytest for proto player where it should open corrupted replay files and test that the code doesn't throw any exception? It'd be great to have some sort of test for it in CI so we can make sure we don't break it in the future (if it's not too much work and doesn't require a large refactor of ProtoPlayer). If you do end up doing this, there could probably be multiple tests, with 1+ for each type of corruption (the replay files could be made within the test using your script).

Mr-Anyone commented 1 month ago

I am not quite sure how I am going to add an pytest to ensure proto players doesn't throw an exception without running it. Most of the time, the exception comes with the following lines:

https://github.com/Mr-Anyone/Thunderbot_Software/blob/e3a58b898fbec0cf6bff0f657b1f1f63dc6d2d03/src/software/thunderscope/replay/proto_player.py#L477-L489

Since almost all exception are inherits the Exception class, the following line would almost always handle all the error: ValueError, IndexError, IoError, etc.

itsarune commented 1 month ago

You can add a new class member to ProtoPlayer called self.has_parsed_successfully which is default True

It gets set to False if we catch an error in the try-catch.

Then, add a getter ProtoPlayer.ha_parsed_successfully that returns self.has_parsed_successfully and assert using this getter in the Pytest.

Mr-Anyone commented 3 weeks ago

I've added a pytest. Do you want this test to be added in the CI?

nimazareian commented 3 weeks ago

I've added a pytest. Do you want this test to be added in the CI?

All tests added to Bazel BUILD will automatically run. It seems like the test is actually failing right now in CI.

Mr-Anyone commented 3 weeks ago

That makes sense to be intended, though it is weird that we get them logged even for normal replays taken from AI vs AI. Here's the replay I got. Could you see why it's logging those messages? Maybe we're not importing a specific protobuf type so it can't parse it in the replay. proto_2024_06_13_203104.zip

I think there is actually a file corruption going on. Here is the full stackframe/backtrace I've got when parsing your zip file:

Traceback (most recent call last):
  File "/home/vhe/.cache/bazel/_bazel_vhe/917dd48f905a75cc113dac63db2f3aa5/execroot/__main__/bazel-out/k8-fastbuild/bin/software/thunderscope/thunderscope_main.runfiles/__main__/software/thunderscope/repla
y/proto_player.py", line 123, in is_log_entry_corrupt
    _ = ProtoPlayer.unpack_log_entry(log_entry)
  File "/home/vhe/.cache/bazel/_bazel_vhe/917dd48f905a75cc113dac63db2f3aa5/execroot/__main__/bazel-out/k8-fastbuild/bin/software/thunderscope/thunderscope_main.runfiles/__main__/software/thunderscope/repla
y/proto_player.py", line 220, in unpack_log_entry
    proto = proto_class.FromString(base64.b64decode(data[len("b") : -len("\n")]))
  File "/usr/lib/python3.8/base64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Invalid base64-encoded string: number of data characters (6373) cannot be 1 more than a multiple of 4

On my end, newly created zipfile works perfectly on my end. Here is one if you are curious: working_proto.zip

nimazareian commented 3 weeks ago

I think there is actually a file corruption going on. Here is the full stackframe/backtrace I've got when parsing your zip file:

Have you unzipped the zip file I sent before trying to run it? I just tried it again and it worked (and the error logs were printed)

bazel run //software/thunderscope:thunderscope_main -- --blue_log ~/Downloads/proto_2024_06_13_203104
nimazareian commented 3 weeks ago

Overall looks good, just wondering why we get those logs spammed in replays. Could you also try running formatting_scripts/fix_formatting.sh script once you're done making your changes? Hopefully that should help CI pass.

Mr-Anyone commented 5 days ago

Update

I've fixed formatting and other stuff.

Mr-Anyone commented 5 days ago

Also, should we merge this in before #3239 ?