Open Mr-Anyone opened 1 month ago
I would like to explain some design decision I made with regards to playing field test fixtures. Before I do so, please take a look at the following replay file for a field test :
This is the last 1000 lines
For context: the pastebin file is organized in timestamp||proto_class||message. The "====" is used to indicate new message.
If you look closely at the first 1000 lines, the timestamp (in this case 1711235119.596471) given in the replay file are all the same. Therefore, we cannot rely on that specific timestamp; we know for sure that is not right. I therefore eliminated all of those protobufs.
However, protobufs with the "epoch_timestamp_seconds" field seems to yield the correct time. To play those protobufs, we just re-index all of those protobufs such that the timestamp starts at 0 and not some arbitrary unix timestamp.
Great finds! I tried the replay broken files again and they work well now, though I am getting the following error when running the second replay (note that the replay still runs fine even with this error):
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/nima/.cache/bazel/_bazel_nima/d4d8c6a2a61c53de6e279fde073469b4/execroot/__main__/bazel-out/k8-fastbuild/bin/software/thunderscope/thunderscope_main.runfiles/__main__/software/thunderscope/replay/proto_player.py", line 557, in __play_protobufs
) = ProtoPlayer.unpack_log_entry(
File "/home/nima/.cache/bazel/_bazel_nima/d4d8c6a2a61c53de6e279fde073469b4/execroot/__main__/bazel-out/k8-fastbuild/bin/software/thunderscope/thunderscope_main.runfiles/__main__/software/thunderscope/replay/proto_player.py", line 293, in unpack_log_entry
proto = proto_class.FromString(base64.b64decode(data[len("b") : -len("\n")]))
google.protobuf.message.DecodeError: Error parsing message with type 'TbotsProto.World'
Quickly looking over your solution, I think it would work moving forward as well, but I think we should take a more proactive approach of fixing the root cause of the replay file lines being corrupt, rather than just fixing the corruption. From what you've mentioned, the problem for the field test replays is related to the timestamp being the same through out the file. When checking what the epoch time is, I get datetime.fromtimestamp(1711235119.596471)
-> datetime.datetime(2024, 3, 23, 16, 5, 19, 596471)
, which is the same day (though not the exact same time?) as when the first recording was taking. This likely shows that the protologger started logging with the initial time the field test had started at, and its time was never updated. ProtoLogger
uses a time_provider
callback function to get the time to use for each protobuf line. The time provider for the field test fixture is defined here:
https://github.com/UBC-Thunderbots/Software/blob/3266eb8bcad093d7a5ebd71cf0e1c3dbe924fd0b/src/software/field_tests/field_test_fixture.py#L101-L105
and the time value is updated based on the time of the latest vision packet: https://github.com/UBC-Thunderbots/Software/blob/3266eb8bcad093d7a5ebd71cf0e1c3dbe924fd0b/src/software/field_tests/field_test_fixture.py#L133-L138
Looking at the pastebin you've provided, it does seem like the vision packet time (t_capture
) is being updated. This would may indicate that self.ssl_wrapper_buffer
isn't being properly updated with the latest wrapper packets, though the problem could be something else.
In any case, it may be better if we just update self.timestamp
to be time.time()
instead of t_capture
(then we can probably remove ssl_wrapper_buffer
as well). The main reason we have the time_provider
callback function as opposed to using time.time()
directly in ProtoLogger
is to support simulated tests running as fast as possible depending on the computer it is running on (e.g. running faster in CI), but have the replay files running in realtime.
cc @PhilD71 as he's more familiar with the field test fixture and may have additional context.
Another useful update would be to have better error handling in or around ProtoPlayer.unpack_log_entry
:
https://github.com/UBC-Thunderbots/Software/blob/1528173facd4b72d7fd4265e99da5ca9e26b1e3d/src/software/thunderscope/replay/proto_player.py#L128-L155
We should be able to detect if log entry is corrupt (could include the protobuf being corrupt and we're unable to deserialize it, or maybe the timestamp being invalid) and skip over that log entry cleanly (and maybe just log that there was a corrupt line). Proper error handling may not be trivial given how the ProtoPlayer is currently written, but I think it's the right step forward for making our replays more robust to more general errors than just checking the timestamp being above some value.
Ideally with fixing the bug that resulted in broken replay files, and a more robust ProtoPlayer, we should be able to replay the all future replays no matter where they are generated from. This way, we shouldn't need specific code to check whether the replay file is from a field test, etc. Also, note that the goal of this ticket isn't to just get those replay files running, since we don't actually care about their content, the goal should be to be more resilient to the same type of corruptions that broke those files.
Also, don't forget to run ./formatting_script/fix_formatting.sh
before you push
I took a look at the broken field test fixture replay log and I'm not sure if it's incorrect. Looking at the timestamps, it is somewhat realistic to have multiple messages attached to the same timestamp. We get vision data every 0.5 seconds (30 Hz) and we expect the timestamp to update every 1 second or so. Our AI sends a ton of messages so it makes sense that a lot of messages would be attached to the same timestamp.
I think we need to get longer field_test logs to be able to really verify this correctly
I took a look at the broken field test fixture replay log and I'm not sure if it's incorrect. Looking at the timestamps, it is somewhat realistic to have multiple messages attached to the same timestamp. We get vision data every 0.5 seconds (30 Hz) and we expect the timestamp to update every 1 second or so. Our AI sends a ton of messages so it makes sense that a lot of messages would be attached to the same timestamp.
I think we need to get longer field_test logs to be able to really verify this correctly
I'm not sure if I fully follow your intuition here. We get vision at 60hz (I also confirmed this by looking at a second of the provided field test replay file and it had 62 SSL_WrapperPacket
s), which would be every ~0.016 second. Not sure why the timestamp would update every 1 second?
Though I wrote a script to read the replay file, similar to the one @Mr-Anyone likely used, and did notice that the timestamp is actually increasing as you mentioned. So the problem was not that. If you're curious, here's the script I used and the output I got. Vincent's pastebin wasn't actually the last 1000 lines of the replay, it was rather the last 1000 lines of the parsed replay which included many new lines for printing the protobufs.
The reason that field test files fail to parse appear to be that since we rely on the time from vision, the first few saved logs may not have a time from vision since we havent received a SSL_WrapperPacket
yet. As a result, the time used for the first few lines is 0 (which is much less than epoch) so proto player thinks the replay has a length of epoch (~54 years?!). Every now and then in middle of the file as well, the wrapper packets time is also 0 for some odd reason (proto.detection.t_capture=0.0
). Overall, I think these problems will go away if start using wall time (time.time
or even better time.time_ns()
for higher precision) instead of relying on the time from vision to organize our replays.
There was also a problem I ran into where the final line of the replay is empty, so we can't split it using our delimeter. Though that could be a bug in my script and how Im reading the line. In either case, stuff like this is the reason I'm advocating for having better exception handling in proto player (e.g. wrapping the .split
and the proto_class.FromString
with try-except, though the challenge may be updating the code to handle scenarios where ProtoPlayer.unpack_log_entry
doesn't return a value due to an exception)
You're right @nimazareian
I must've fried my brain yesterday after hiking. 60 Hz means a packet every 0.02 s. In that case, these changes make sense
That script is fairly useful I think for debugging proto logger issues. Maybe we should consider adding it onto the repo
Here is a script that generated 10% corrupted log entries:
import random
import base64
from software.thunderscope.replay.replay_constants import *
from software.thunderscope.replay.proto_player import ProtoPlayer
import gzip
import os
def create_invalid_log_entries(proto, current_time, frequency=0.1):
some_random_value_between_one_and_zero = random.random()
# remove some delimeter
if some_random_value_between_one_and_zero < frequency/2:
# intentionally corrupt some entries
serialized_proto = base64.b64encode(proto.SerializeToString())
log_entry = (
f"{current_time}"
+ f"{proto.DESCRIPTOR.full_name}{REPLAY_METADATA_DELIMETER}"
+ f"{serialized_proto}\n"
)
return log_entry
# create invalid/corrupt data
elif some_random_value_between_one_and_zero < frequency:
# intentionally corrupt some entries
serialized_proto = base64.b64encode(proto.SerializeToString())
log_entry = (
f"{current_time}{REPLAY_METADATA_DELIMETER}"
+ f"{proto.DESCRIPTOR.full_name}thisissomethingthatwouldbebad{REPLAY_METADATA_DELIMETER}"
+ f"{serialized_proto}thisisalsosomethingbad\n"
)
return log_entry
# intentionally corrupt some entries
serialized_proto = base64.b64encode(proto.SerializeToString())
log_entry = (
f"{current_time}{REPLAY_METADATA_DELIMETER}"
+ f"{proto.DESCRIPTOR.full_name}{REPLAY_METADATA_DELIMETER}"
+ f"{serialized_proto}\n"
)
return log_entry
if __name__ == "__main__":
path_to_recover = input("path to valid protobufs to corrupt: ")
save_path = input("path to save: ")
os.makedirs(save_path, exist_ok=True)
sorted_chunks = sorted(os.listdir(path_to_recover))
count = 0
for file in sorted_chunks:
path_to_file = os.path.join(path_to_recover, file)
print(f"Writing file: {path_to_file}")
chunks = ProtoPlayer.load_replay_chunk(path_to_file)
with gzip.open(os.path.join(save_path, f"{count}.replay"), "wb") as log_file:
for log_entries in chunks:
timestamp, proto_class, message = ProtoPlayer.unpack_log_entry(log_entries)
log_entries = create_invalid_log_entries(message, timestamp)
log_file.write(bytes(log_entries, encoding='utf-8'))
count += 1
This is an output where 10 percent of the code is invalid.
Please note that the program above all entries in the replay file to be valid before use.
Run thunderscope with the zip file attached from above.
That script is fairly useful I think for debugging proto logger issues. Maybe we should consider adding it onto the repo
That's a good idea. @Mr-Anyone I've cleaned up the script and put it in this branch. Could you include it in your PR: https://github.com/nimazareian/Software/commit/17bd68131cd3402d50ee32665fb525e03be44fc8
Note that for some reason I ran into a weird error when trying to format this script using fix_formatting.sh
, which may cause issues in CI, though it might've just been an error on my end.
It's great that you were able to test with a corrupted replay file and make the proto player more robust! I'm not sure how difficult it is, but if it's straight forward, could you add a pytest for proto player where it should open corrupted replay files and test that the code doesn't throw any exception? It'd be great to have some sort of test for it in CI so we can make sure we don't break it in the future (if it's not too much work and doesn't require a large refactor of ProtoPlayer
). If you do end up doing this, there could probably be multiple tests, with 1+ for each type of corruption (the replay files could be made within the test using your script).
I am not quite sure how I am going to add an pytest to ensure proto players doesn't throw an exception without running it. Most of the time, the exception comes with the following lines:
Since almost all exception are inherits the Exception
class, the following line would almost always handle all the error: ValueError, IndexError, IoError, etc.
You can add a new class member to ProtoPlayer
called self.has_parsed_successfully
which is default True
It gets set to False
if we catch an error in the try-catch.
Then, add a getter ProtoPlayer.ha_parsed_successfully
that returns self.has_parsed_successfully
and assert using this getter in the Pytest.
I've added a pytest. Do you want this test to be added in the CI?
I've added a pytest. Do you want this test to be added in the CI?
All tests added to Bazel BUILD will automatically run. It seems like the test is actually failing right now in CI.
That makes sense to be intended, though it is weird that we get them logged even for normal replays taken from AI vs AI. Here's the replay I got. Could you see why it's logging those messages? Maybe we're not importing a specific protobuf type so it can't parse it in the replay. proto_2024_06_13_203104.zip
I think there is actually a file corruption going on. Here is the full stackframe/backtrace I've got when parsing your zip file:
Traceback (most recent call last):
File "/home/vhe/.cache/bazel/_bazel_vhe/917dd48f905a75cc113dac63db2f3aa5/execroot/__main__/bazel-out/k8-fastbuild/bin/software/thunderscope/thunderscope_main.runfiles/__main__/software/thunderscope/repla
y/proto_player.py", line 123, in is_log_entry_corrupt
_ = ProtoPlayer.unpack_log_entry(log_entry)
File "/home/vhe/.cache/bazel/_bazel_vhe/917dd48f905a75cc113dac63db2f3aa5/execroot/__main__/bazel-out/k8-fastbuild/bin/software/thunderscope/thunderscope_main.runfiles/__main__/software/thunderscope/repla
y/proto_player.py", line 220, in unpack_log_entry
proto = proto_class.FromString(base64.b64decode(data[len("b") : -len("\n")]))
File "/usr/lib/python3.8/base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Invalid base64-encoded string: number of data characters (6373) cannot be 1 more than a multiple of 4
On my end, newly created zipfile works perfectly on my end. Here is one if you are curious: working_proto.zip
I think there is actually a file corruption going on. Here is the full stackframe/backtrace I've got when parsing your zip file:
Have you unzipped the zip file I sent before trying to run it? I just tried it again and it worked (and the error logs were printed)
bazel run //software/thunderscope:thunderscope_main -- --blue_log ~/Downloads/proto_2024_06_13_203104
Overall looks good, just wondering why we get those logs spammed in replays. Could you also try running formatting_scripts/fix_formatting.sh
script once you're done making your changes? Hopefully that should help CI pass.
I've fixed formatting and other stuff.
Also, should we merge this in before #3239 ?
Description
resolves #3154
I implemented essentially two features:
Testing Done
Ran the following script. The two zipfiles came from the issues. Checkout this branch and run the following four commands.
Resolved Issues
resolves #3154
Length Justification and Key Files to Review
N/A
Review Checklist
It is the reviewers responsibility to also make sure every item here has been covered
.h
file) should have a javadoc style comment at the start of them. For examples, see the functions defined inthunderbots/software/geom
. Similarly, all classes should have an associated Javadoc comment explaining the purpose of the class.TODO
(or similar) statements should either be completed or associated with a github issue