NextCenturyCorporation / MCS

Repo for the Machine Common Sense project
https://nextcenturycorporation.github.io/MCS/
Apache License 2.0
11 stars 8 forks source link

Depth and RGB Frame misaligned from simulator rendering. #228

Closed jaypatravali closed 2 years ago

jaypatravali commented 3 years ago

Hello, I noticed a problem where running the simulator or level1/level2 the RGB and depth image at frame_i dont seem to be aligned. This gives us bad results because we rely on aligned image data. This link has the input RGB and depth stored as image files. Below is frame-4 for depth and RGB that goes to our vision modules. I have attached a zip file with RGB/depth images generated from the simulator for a complete scene. Also provided is the json file used to render this scene. I have tried many other scenes downloaded from eval3 submission webpage, which also show this problem. An offset +1 on depth frame seems to fix it. depth-04 original-04

JSON rendering this scene https://oregonstate.box.com/s/qis6xyc2vmmgavavb1jkdapwdr0yanvx

This is also cross-posted on the slack: https://machinecommonsense.slack.com/archives/CTR31V12M/p1612298520083300.

bzinberg commented 3 years ago

Wall of text incoming: pasting lots more context from the past several months.

bzinberg commented 3 years ago

@bzinberg 2020-Nov-22

Bug report: Simulator RGB output is off by one

Summary The RGB image in StepMetadata.image_list is one frame behind. The GUI and the oracle percepts are not. In particular:

  1. The RGB image does not match the oracle percepts.
  2. For passive scenes, there is no way to access the RGB image of the final frame.
  3. The RGB image for the initial frame is nonsense, so there is no reasonable way to choose an action other than Pass. Thus, would like your help confirming whether this issue also exists in the interactive scenes.

To reproduce With this directory as your working directory, invoke the repro.py script, specifying the path to your Unity executable:

python repro.py --unity_app_file_path=/path/to/MCS-AI2-THOR-Unity-App-v0.3.3.x86_64

When prompted at the terminal, take screenshots of the Unity GUI window. These screenshots, in comparison to the output RGB images, show the issue.

Attachment: 20201122_off_by_one.tar.gz

bzinberg commented 3 years ago

@bzinberg 2020-Nov-22 whoops, forgot to un-hardcode the output directory. In repro.py, global find-replace /home/bzinberg/probcomp/mcs_bug_reports/20201122_off_by_one with .

bzinberg commented 3 years ago

@ThomasSchellenbergNextCentury 2020-Nov-24

@bzinberg So @brianpippin was looking into this issue yesterday and neither of us can reproduce it. We both ran your repo script and got the correct image results (not the same as your results). I've attached the images I generated from your repo script. Is it possible that you don't have the correct version of the MCS python library installed? image image image

bzinberg commented 3 years ago

@bzinberg

I'm working off a local clone at commit 2e7122, pip says that's version 0.3.3. AFAICS, that commit hash is still the latest in upstream master.

I assumed that installing from master of a local clone would give me something at least as recent as the release, and vaguely remember from previous evals that the newer version was sometimes needed. Is that wrong? / what commit hash were you @ThomasSchellenbergNextCentury and @brianpippin working from?

bzinberg commented 3 years ago

@ThomasSchellenbergNextCentury

Hmm, that sounds right to me. I did a clean checkout and pip install of the MCS master branch to test your repo script. Commit hash 2e71220d5949bc7620936579c2b20ce5c11f2eab

bzinberg commented 3 years ago

@bzinberg ok, I'll add to my todo list to retry this on a(nother) fresh environment. BTW, is it possible that this is a platform-specific issue? (Ubuntu 18.04, Python 3.6.8 virtualenv, ...)

@ThomasSchellenbergNextCentury Maybe, but seems unlikely. I'm on Ubuntu 20.04, Python 3.8.5, and not using a virtual env. Brian is on Mac.

bzinberg commented 3 years ago

@bzinberg Reproduced the issue again. Full shell transcript and the output PNGs attached.

bzinberg commented 3 years ago

@bzinberg Might the simulator's behavior depend on the underlying video hardware? My first guess (definitely not an expert) would be that somewhere under the hood a function is being called to retrieve an RGB image from the GPU and its output is hardware-dependent.

➜  sudo lshw -C video
  *-display
       description: VGA compatible controller
       product: UHD Graphics 620
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       logical name: /dev/fb0
       version: 07
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=i915 latency=0 mode=2560x1440 visual=truecolor xres=2560 yres=1440
       resources: iomemory:2f0-2ef iomemory:2f0-2ef irq:126 memory:2ffa000000-2ffaffffff memory:2fa0000000-2fafffffff ioport:e000(size=64) memory:c0000-dffff

@ThomasSchellenbergNextCentury That seems as good a guess as any. @Steven Borkman since you're our local Unity expert, do you have any thoughts about Ben's issue (see above)?

@Steven Borkman So, I would have to dig deeper into the specifics of how ai2thor is reading the info off of the GPU. But there are timing issues that can come from using certain GPU readback approaches, like Texture2D.ReadPixels

Internally in our perception package, which is our AI team’s approach of generating synthetic data for CV model training we use https://docs.unity3d.com/ScriptReference/Rendering.CommandBuffer.RequestAsyncReadback.html where possible

and put the readback code in https://docs.unity3d.com/ScriptReference/Rendering.RenderPipelineManager-endCameraRendering.html

These require the use of scriptable render pipelines (SRPs) like URP or HDRP (Unity supported pipelines that a user can use out of the box)

Using CommandBuffers makes it so that the readback is enqueued in the list of graphics commands, so it is guaranteed to happen in the same place relative to the other rendering, whereas something like ReadPixels will sometimes do the readback immediately, even before the other rendering has finished

a tool like RenderDoc is really useful for diagnosing issues like these.

https://renderdoc.org/

RenderDoc is only supported on Linux and Windows unfortunately for Mac users like me

@ThomasSchellenbergNextCentury There's a similar issue posted on the original AI2-THOR GitHub, but without any conclusions. https://github.com/allenai/ai2thor/issues/383

bzinberg commented 3 years ago

@Alan Fern Thanks for working on this. For some context, this "discovery" helps explain some issues we were having with consistency even within our team. For example, our perception networks were trained it turns out without the inconsistent depth, while other development that used those models was sometimes with and sometimes without.

@bzinberg (Note that these issues will probably also affect data generators!)

bzinberg commented 3 years ago

@ThomasSchellenbergNextCentury @bzinberg @jaypatravali @Alan Fern What OS and hardware are you using when this issue occurs?

bzinberg commented 3 years ago

My OS and hardware:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:    18.04
Codename:   bionic

$ uname -a
Linux bzinberg-probcomp 5.0.0-43-generic #47~18.04.1-Ubuntu SMP Mon Mar 2 04:28:21 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

$ sudo lshw -C video
  *-display                 
       description: VGA compatible controller
       product: UHD Graphics 620
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       logical name: /dev/fb0
       version: 07
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=i915 latency=0 mode=1920x1080 visual=truecolor xres=1920 yres=1080
       resources: iomemory:2f0-2ef iomemory:2f0-2ef irq:125 memory:2ffa000000-2ffaffffff memory:2fa0000000-2fafffffff ioport:e000(size=64) memory:c0000-dffff
jaypatravali commented 3 years ago

@bzinberg @ThomasSchellenbergNextCentury at 'alan fern'

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:    18.04
Codename:   bionic

Linux vayuyaan 5.4.0-65-generic #73~18.04.1-Ubuntu SMP Tue Jan 19 09:02:24 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

(base) jay@vayuyaan:~$ sudo lshw -C video
  *-display                 
       description: 3D controller
       product: GM107M [GeForce GTX 960M]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list
       configuration: driver=nouveau latency=0
       resources: irq:133 memory:93000000-93ffffff memory:80000000-8fffffff memory:90000000-91ffffff ioport:4000(size=128)
  *-display
       description: VGA compatible controller
       product: HD Graphics 530
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 06
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: irq:132 memory:92000000-92ffffff memory:a0000000-afffffff ioport:5000(size=64) memory:c0000-dffff
ThomasSchellenbergNextCentury commented 3 years ago

Thank you all for the info. Our team has been investigating this issue and are still having trouble reproducing it. @erick-u , a Unity employee and member of the MCS TA2 team, has reached out to his colleagues at Unity for support. If anyone would know what's causing this issue, it's Unity!

Additionally, we're in the middle of updating our fork of the AI2-THOR Python and Unity codebases to their latest versions in preparation for Eval 4 training. Maybe we'll get lucky and that will solve this issue for us.

Worst case, we're still moving forward with our plan to give each TA1 team access to their own EC2 instance on our AWS account. Since these machines will have the same OS and hardware that we're using to run the evaluations, TA1 and TA2 will be able to work together to address any lingering issues. I know this doesn't help you for training on your internal environments, so we won't be abandoning this issue entirely.

Please let us know if you have any questions or comments about our progress.

bzinberg commented 3 years ago

Great, thanks for your work handling this on multiple fronts, and for the updates!

bzinberg commented 3 years ago

@erick-u, any update on this?

dcharrezt commented 3 years ago

Hi, this is happening in my end too for the interactive task scenes.

My current hardware and OS

sb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:    20.04
Codename:   focal

$uname -a
     Linux ct 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$sudo lshw -C video
*-display                 
       description: VGA compatible controller
       product: Haswell-ULT Integrated Graphics Controller
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 09
       width: 64 bits
       clock: 33MHz
       capabilities: msi pm vga_controller bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: irq:48 memory:b2000000-b23fffff memory:a0000000-afffffff ioport:5000(size=64) memory:c0000-dffff
erick-u commented 3 years ago

Hi,

Due to reproducibility issues, we are securing computers with exact specifications matching some of the reported hardware to validate the issue and fixes.

I believe that due to usage of each ReadPixels() blocking while flushing the GPU and the write operations of the image generation causing timing issues in the image generation as Steven Borkman investigations revealed.

We have several ideas lined up to solve this but require architecture improvements.

Currently, I have started the AI2Thor upgrade which improves upon the core architecture, image generation and allows us to update the current version of Unity the project is using. This could solve the issue and/or open up more avenues for resolving this issue including the possibility of converting to the scriptable render pipeline, using AsyncGPUReadback, and others.

We will follow up with updates as we move forward.

ThomasSchellenbergNextCentury commented 3 years ago

@jaypatravali @bzinberg @dcharrezt Did you all experience these errors while running scenes independently (one at a time), or while running a lot of scenes sequentially (with the same code) using the same MCS controller object?

bzinberg commented 3 years ago

For me, one at a time. The part of the codebase I was working on created a new controller every time a scene was run, and it saw a mismatch between depth image and ground-truth percepts (that code didn't touch RGB image, but as seen above RGB image was also off from ground truth).

jaypatravali commented 3 years ago

@ThomasSchellenbergNextCentury : I go through a lot of scenes me its the same MCS controller object and same code.

dcharrezt commented 3 years ago

For me happens when running scenes independently.

deanwetherby commented 3 years ago

I ran @bzinberg 's script but did not see a difference between the images in the UI and the metadata. Here's my hardware:

~ $ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:    20.04
Codename:   focal
~ $ uname -a
Linux dwetherby-Alienware-15-R3 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
~ $ sudo lshw -C video
  *-display                 
       description: VGA compatible controller
       product: GP104BM [GeForce GTX 1080 Mobile]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:158 memory:dc000000-dcffffff memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:e000(size=128) memory:c0000-dffff
  *-display
       description: Display controller
       product: HD Graphics 630
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 04
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm bus_master cap_list
       configuration: driver=i915 latency=0
       resources: irq:150 memory:db000000-dbffffff memory:70000000-7fffffff ioport:f000(size=64)
deanwetherby commented 3 years ago

I added cv2.imshow for the visual, depth, and segmentation masks. All image types were available on the first and last frames for my hardware and they appeared to be synchronous.

MicrosoftTeams-image

bzinberg commented 3 years ago

Attaching an updated version of the repro script in https://github.com/NextCenturyCorporation/MCS/issues/228#issuecomment-774273198 that uses the new MCS controller API. Confirmed just now that I'm still seeing the issue in MCS 0.3.8, Unity app version 0.3.8. My OS and hardware are as above, and more info about my Python environment is attached below as python_environment.txt.

Attachment 1: 20210226_updated_off_by_one.zip Attachment 2: python_environment.txt

ThomasSchellenbergNextCentury commented 3 years ago

Based on Ben's latest set of output images, since his initial RGB image is what you see between starting Unity (with create_controller) and initializing the scene (with start_scene), I'm guessing there's some lag happening between those two parts.

Here's the expected workflow:

  1. We create an AI2-THOR Controller object from the MCS Python API, which starts Unity: https://github.com/NextCenturyCorporation/ai2thor/blob/development/ai2thor/controller.py#L416-L426
  2. AI2-THOR automatically submits a Reset action (which sets the default floor and wall textures shown in Ben's initial RGB image): https://github.com/NextCenturyCorporation/ai2thor/blob/development/ai2thor/controller.py#L434-L442
  3. We submit an Initialize action from the MCS Python API, calling the Controller's step function with our JSON scene data: https://github.com/NextCenturyCorporation/ai2thor/blob/development/ai2thor/controller.py#L691
  4. It appears step should be waiting for the previous action to finish, but maybe it's not? https://github.com/NextCenturyCorporation/ai2thor/blob/development/ai2thor/controller.py#L736-L738

@deanwetherby @brianpippin Maybe the Initialize is finished before the Reset, so when the Reset is finished, it overrides part of the Initialize action's output data before returning to the caller? Or it overrides properties in the AI2-THOR output metadata object? (Reset may not return depth data, which is why the depth data isn't overridden.)

We're still investigating this issue on multiple fronts, and members of TA2 are reaching out to members of TA1 for their support (stay tuned).

deanwetherby commented 3 years ago

Could one of you who is having this issue perform an experiment for us? I'd like you to use vanilla ai2thor in a fresh python virtual environment to see if it exhibits the same issue. This will give us another data point for troubleshooting since we can't seem to replicate the issue.

1) Create a fresh virtual environment. This is one of many possible ways.

$ mkdir ai2thor-test && cd ai2thor-test
$ python3.6 -m venv --prompt ai2test venv
$ source venv/bin/activate
(ai2test) $ python -m pip install --upgrade pip setuptools wheel
(ai2test) $ python -m pip install ai2thor==2.2.0 Pillow
(ai2test) ai2thor-test $ python -m pip list
Package      Version
------------ ---------
ai2thor      2.2.0
certifi      2020.12.5
chardet      4.0.0
click        7.1.2
Flask        1.1.2
idna         2.10
itsdangerous 1.1.0
Jinja2       2.11.3
MarkupSafe   1.1.1
msgpack      1.0.2
numpy        1.19.5
Pillow       8.1.0
pip          21.0.1
progressbar2 3.53.1
python-utils 2.5.6
PyYAML       5.4.1
requests     2.25.1
setuptools   54.0.0
six          1.15.0
urllib3      1.26.3
Werkzeug     1.0.1
wheel        0.36.2

2) Create and run the following script. The first time you run the ai2thor controller, it will download a Unity runtime.

thor-202001071627-Linux64: [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  25.3 MiB/s]  of 524.MB
import time

from PIL import Image
from ai2thor.controller import Controller

controller = Controller(
        width=600,
        height=400,
        renderDepthImage=True,
        renderInstanceSegmentation=True)

event = controller.step("MoveAhead")

img = Image.fromarray(event.frame)
img.show()

time.sleep(3)

controller.stop()

The script pauses for three seconds so you can compare the event.frame with the Unity window.

Unity window Screenshot from 2021-03-01 12-17-57

Event Frame Screenshot from 2021-03-01 12-17-59

bzinberg commented 3 years ago

@deanwetherby, here are my results. The GUI screenshot was taken after the camera visibly moved forward in the scene. There does appear to be a timing difference between the two.

Event frame: event_frame GUI screenshot: gui_screenshot Text attachments: transcript.txt 1614635148_piplist.txt

deanwetherby commented 3 years ago

@bzinberg My guess is there's some underlying ai2thor problem with initialization. I wonder what the state of the depth map and segmentation mask is. Thanks for the additional data point with this test. Will follow up shortly.

deanwetherby commented 3 years ago

@bzinberg If you wouldn't mind, could you run this modified script to see what the side-by-side is on the visual RGB and segmentation mask is?

import time
import numpy as np

from PIL import Image
from ai2thor.controller import Controller

controller = Controller(
        width=600,
        height=400,
        renderDepthImage=True,
        renderObjectImage=True)

event = controller.step("MoveAhead")

combined = Image.fromarray(
        np.hstack((event.frame, event.instance_segmentation_frame))
        )
combined.show()

time.sleep(3)

controller.stop()                    

I got this: Screenshot from 2021-03-01 19-43-31

bzinberg commented 3 years ago

@deanwetherby Oh wow, they do seem to be inconsistent. combined

jaypatravali commented 3 years ago

This is what I have on my mind. Screenshot from 2021-03-01 19-22-20 Screenshot from 2021-03-01 19-20-22

deanwetherby commented 3 years ago

Now we have something to go on. Thanks @bzinberg and @jaypatravali.

ThomasSchellenbergNextCentury commented 3 years ago

@bzinberg @jaypatravali and anyone else who wants to help: Would you please mind trying the above again, but using AI2-THOR version 2.5.0?

  1. Create a new python virtual environment, like Dean described in his first comment yesterday, but pip install ai2thor==2.5.0 rather than 2.2.0
  2. Create and run a script that shows the RGB image and object mask side-by-side, like Dean described in his other comment yesterday, but replace "MoveAhead" with "LookDown" (note that the default scene is different in this version).

This is my result:

ai2thor-2 5

Thanks!

jaypatravali commented 3 years ago

@ThomasSchellenbergNextCentury @bzinberg Screenshot from 2021-03-02 12-14-58 still looks off

bzinberg commented 3 years ago

still looks off

Same here. combined

deanwetherby commented 3 years ago

I noticed in @jaypatravali's hardware description that the GPU was using the nouveau open source drivers instead of nvidia. I switched to using the open source version, rebooted, and tried the same steps from above. Results were the same for me so ruling out drivers as a possibility.

~ $ sudo lshw -C video
  *-display                 
       description: VGA compatible controller
       product: GP104BM [GeForce GTX 1080 Mobile]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nouveau latency=0
       resources: irq:152 memory:dc000000-dcffffff memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:e000(size=128) memory:c0000-dffff
  *-display
       description: Display controller
       product: HD Graphics 630
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 04
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm bus_master cap_list
       configuration: driver=i915 latency=0
       resources: irq:150 memory:db000000-dbffffff memory:70000000-7fffffff ioport:f000(size=64)

tmpf787gwxq

deanwetherby commented 3 years ago

Another test we can do is to look at the rendered images right after controller initialization rather than relying on taking a 'LookDown' or 'MoveAhead' action. Ai2thor stores event information in the last_event attribute even if it doesn't return it explicitly. Could either @bzinberg, @jaypatravali, or anyone else experiencing this issue run this additional test in the ai2thor vanilla environment? Thank you for the continued support on these experiments by the way.

import time
import numpy as np

from PIL import Image
from ai2thor.controller import Controller

controller = Controller(
        width=600,
        height=400,
        renderDepthImage=True,
        renderObjectImage=True)

combined = Image.fromarray(
        np.hstack((
            controller.last_event.frame,
            controller.last_event.instance_segmentation_frame))
        )

combined.show()
time.sleep(3)
controller.stop()
bzinberg commented 3 years ago

Thank you for the continued support on these experiments by the way.

Of course!

Hm, not what I expected: combined

I was expecting to see a pool of liquid mercury as shown in an earlier rgb_initial.png (https://github.com/NextCenturyCorporation/MCS/issues/228#issuecomment-774273198): image

Ai2thor stores event information in the last_event attribute even if it doesn't return it explicitly.

Are we certain that the delay doesn't somehow get introduced when storing and retrieving that field? I hope not, but a certain amount of paranoia may be useful.

deanwetherby commented 3 years ago

I was fully expecting to see the pool of liquid mercury as well and this experiment was meant to verify that suspicion. This is a surprising result but does give us something new to look into. :+1:

ThomasSchellenbergNextCentury commented 3 years ago

@bzinberg Between the controller = Controller( line and the combined = Image.fromarray( line, can you please add controller.step("LookDown") and run it again?

(PS: It's a carpet, not liquid mercury!)

bzinberg commented 3 years ago

Well, well, well... combined

(PS: It's a carpet, not liquid mercury!)

:open_mouth:

deanwetherby commented 3 years ago

Although we will continue to look at this ai2thor issue, we do not have an expected time frame for resolution. I would recommend those experiencing this problem find workarounds for the pending gravity support evaluation on March 15. We believe the issue resides deep in the ai2thor Unity scene initialization but only for specific hardware (integrated graphics?). Despite our best efforts, none of the MCS core developers can replicate this issue which makes debugging purely speculative. Our MCS Unity devs are looking more closely at the initialization process but nothing has stood out to them so far. Note that we do not expect to run into this misalignment when running the evaluation on GPU EC2 instances. We'll update this issue when we know more.

bzinberg commented 3 years ago

Thanks @deanwetherby for the update. Very helpful to know about the decreased certainty we have for how long it will take to diagnose and fix this issue.

TLDR:

  1. There are important reasons EC2 instances are not a complete solution, and I'd like to make sure we're appropriately wary of relying on them as such.
  2. Given the current information, I think the best way forward is to declare an officially supported eGPU model which all TA1 teams can buy and use for development.

1. EC2 instances are not a complete solution

To develop their solutions, TA1 teams need to be able to develop locally. Using TA2-provided EC2 instances in the development loop would require a seamless integration for dialing into EC2 that "looks" exactly the same as a locally running instance of the simulator. AFAIK, this would be a huge infrastructure project that has quite a bit of implementation risk and would require the program to carve out a lot of staffing and extra room in the timeline to execute and maintain.

2. An officially supported eGPU model could be a good-enough solution

Much cheaper and (I think) less risky than the development work required to release and maintain an EC2-based development loop, would be to declare one specific model of external GPU, confirmed to not have the frame delay issue on all major operating systems, to be officially supported and simply have each TA1 team purchase such a GPU for everyone on their team who needs to be able to run the simulator (plus a couple of extras just in case). This would sidestep the above problems of staffing and timeline, we can make a quick initial assessment of viability much more quickly, and if the initial assessment looks good I think it will require much less maintenance.

The Cora team already purchased an eGPU for a team member who was trying to contribute to Eval 3 near the submission deadline, but didn't end up using it in the end, instead relying on others running on their machines and reporting back. It is a GeForce GTX 1050Ti inside an external enclosure. I'm going to see if I can get my laptop set up with it and see whether that eliminates the issue. If so, perhaps we can make that the officially supported model.

deanwetherby commented 3 years ago

@bzinberg Thanks for the feedback and the additional information. I am very curious what you find out with eGPU.

To be clear, I was not proposing that TA2 being involved in any way with the TA1 development cycle.

Although TA2 EC2 instances would not appropriate for TA1 development and model training, they are however ideal for us to run evaluations. We understand that these machines are not "hardware identical" to your local environments. This is why it was such an important lesson learned in the last eval that we include TA1 in the build and verification process on our EC2 instances.

TA1 verifying and/or executing the build on a single representative EC2 instance will ensure the most replicable results possible between your local dev environments and our instances. From there our team can exactly replicate the verified build onto the 100s of machines needed to process all of the evaluation scenes.

Does this address your concerns?

bzinberg commented 3 years ago

Yes, perfect.

Agreed that moving the eval to EC2 instances and making those same instance types available to TA1 beforehand is a great step to detect and prevent the kind of discrepancies that came up last eval -- and much appreciate the work and problem solving you and team have done to put that into place.

Thanks for clarifying that the EC2 instances are not expected to also address the local development issues.

To highlight why the local development issues are so important: they include not just buggy behavior on an individual's machine, but also the inability to reproduce the same results among members of our team -- and for that reason they significantly impact the pace of progress we are able to make.

Very much hoping the eGPU solution pans out, will report back soon. Thanks @deanwetherby and team for your support on this.

erick-u commented 3 years ago

@bzinberg @jaypatravali

We have made this custom build to gather more information regarding this bug from the Unity side. We have synced render frames and encoding as well as adjusted initialization and would like feedback on this test. This could possibly fix the issue, or at the very least provide even more information as to what may be happening.

The following is the same repro environment @bzinberg created with the new development build. https://drive.google.com/drive/folders/1qo69tNLQ4nNzjhrgsoAZxk7pJEVwlkDl

The same reproduction steps should be followed:

To reproduce With this directory as your working directory, invoke the repro.py script, specifying the path to your Unity executable:

python repro.py --unity_app_file_path=/path/to/MCS-AI2-THOR-Unity-App-v0.3.8.x86_64

When prompted at the terminal, take screenshots of the Unity GUI window.

Please confirm if the same results are occurring and attach the screenshots, rgb images, rgb_unity images as well as the log file generated located here:

/home/YOURNAME/.config/unity3d/CACI with the Allen Institute for Artificial Intelligence/MCS-AI2-THOR/Player.log

erick-u commented 3 years ago

The latest version actually just finished uploading. If you pulled the build before this message, please download the latest.

bzinberg commented 3 years ago

Will do, @erick-u. Are you able to link to the source code / PR of the new binary?

erick-u commented 3 years ago

@bzinberg this is currently on a local branch with some hacks around it for testing. I will push it into a branch if you wish to see what we are attempting to do.