Rewards - Githubissues

NishanthVAnand commented 3 years ago

Dear Authors,

I was wondering if the codebase has python implementation of the reward function, reward schedules, and compositions listed in the paper. If yes, can you please point it to me? I wanted to use those in my own experiments with the openai-gym interface.

Thanks!

asaparov commented 3 years ago

Ah we haven't yet implemented those in the Python API. They are only implemented in Swift currently. But hopefully the description of the reward functions and reward schedules are sufficient to be able to reimplement them? If anything is unclear or ambiguous, please let us know and we'll be happy to clarify.

NishanthVAnand commented 3 years ago

Thanks! Can you please clarify how the reward rate is computed by providing the formula? Also, can you briefly tell me why it is a good measure to plot? It wasn't too clear from the paper.

asaparov commented 3 years ago

Sure, let's say r(t) is the reward obtained at time step t. The reward rate at time t, with respect to a time window T, is defined as \sum_{s=t-T}^t r(t) / T, assuming t >= T. (note that if t < T, then the reward rate is just \sum_0^t r(t) / t)

Note that the reward rate for non-time-varying agents, such as the greedy agents, is constant over time. To compute those, we just ran the agent for a very long time and divided the total reward by the total time.

We chose this metric because simply measuring total reward would not capture how the agent's ability to find reward is changing over time. We chose to use a sliding window since the agent's ability to find reward can change even after a lot of time has passed, for example, if the agent collects all the jellybeans in a large region and "exhausts" the available reward, this metric would show a decline after a period of time. In addition, it's easy to see with this metric if the agent becomes "stuck" and unable to collect any additional reward. Also, the spatial density of jellybeans in a region provides a simple (but possibly crude) upper bound on the reward rate.

NishanthVAnand commented 3 years ago

@asaparov Thank you very much. These comments are very helpful. I have a few follow up questions, if you don't mind.

it's easy to see with this metric if the agent becomes "stuck" and unable to collect any additional reward.

Can you please expand how we'd know if the agent is "stuck"? If the reward rate has plateaued, then it could also mean that agent is behaving optimally right?

the spatial density of jellybeans in a region provides a simple (but possibly crude) upper bound on the reward rate.

Can you please expand more on this? If I'm not wrong, sparsity of an item is controlled by the intensity argument - which tells how dense the items appear in a patch (roughly). So, how can we use this quantity to get an upper bound for the reward rate?

Additional questions:

Using the environment configuration as described in Table 3 (setting intensity arguments to a positive value) leads to super dense environment, i.e., items appear everywhere in the grid. Was this a design choice? In my experiments, setting the intensity argument value to a negative quantity, say -4.5 or -5.5, seems to produce envs that has a balance between sparse and dense rewards. Am I understanding something wrong?
Is there an intuitive rule to set the value of intensity function argument? For instance, if I want 5-6 items to appear in every 11x11 subgrid what's the value I should be using?

I really appreciate for taking the time to answer my questions :)

asaparov commented 3 years ago

Sure no problem!

Ah, by "stuck", I meant that the agent stops collecting any rewards, for example if it starts going in a loop. The reward rate would then go to zero. This actually happened in some of our experiments. However, if we instead used an infinite window (T -> inf in the above equation), it would not be so apparent that the agent has become stuck. This is why we used a window of 100,000 (for most of our experiments).

Yes, if the interaction functions are zero, then the intensity function is directly related to the density of the items. In fact, this is just a Poisson point process, and the probability of an item existing at any given position (x,y) is given by exp{\lambda} where \lambda is the intensity. This also explains why setting \lambda to a positive value results in items appearing everywhere, since in this case, exp{\lambda} >= 1. This also gives you a rough idea of how to set the intensity for a desired density. If you want 6 items to appear in an 11x11 subgrid, you want the probability of an item to be 6/121 = exp{\lambda}, and so you can set \lambda to log(6/121) ~= -3.

But if the interaction functions are not zero, this simple rule may no longer be true. And in this case, the intensity may no longer be as directly related to the spatial density of items, and therefore, it may no longer be as good of an upper bound on the reward, but I agree that it could still serve as a good upper bound so long as your interaction functions aren't too crazy.

NishanthVAnand commented 3 years ago

@asaparov Thank you very much.

I have one final question about the code: I'm using the python api and I want to modify the vision part of the agent state. For instance, I want to represent item's approximate location instead of the exact location. How/Where do I modify the it?

asaparov commented 3 years ago

Do you mean compute the vision as if the position of the items were slightly different from their true position? Or are you more generally interested in adding noise to the visual field? For the former, the agent's visual field is computed in the C++ code here. This function updates the agent's scent and vision buffers by iterating over all the nearby items and agents, and adding their contributions to both the scent and vision. The loop over items is on line 846. Line 861 and 862 are where their vision contribution is computed. You could add noise to item.position before line 861 here, but be careful not to actually write to item since it is a reference to the actual underlying item object, so changing its position will actually change the position of the item itself. Instead, you should make a copy and only modify the copy.

asaparov commented 3 years ago

Another fully Python option is to use the simulator._map method to get a list of all the items within a certain bounding box, and then manually compute a vision vector (mirroring the C++ vision computation). As an example, the visualizer uses this _map function here.

NishanthVAnand commented 3 years ago

I am interested in representing the visual field similarly to scent field. I want to add diffusion/decay to the visual field but with a different diffusion parameter. Do you think the above suggestions are valid for my setting?

I will look into both the codes.

asaparov commented 3 years ago

Diffusion to the visual field? Interesting. Could you define it a bit more precisely? Perhaps as a formula? I have a number of ideas about what you could be referring to but I'm not sure which is the one you have in mind. For example, does the visual field of the agent at time t depend only on the positions and types of the items at time t, or does it also depend on the visual field at time t - 1?

NishanthVAnand commented 3 years ago

Perhaps as a formula?

If an object is present at the (x,y) location, then it will have an exponentially decaying influence in the entire grid. Perhaps, a low value of influence, say < 0.05 can be ignored (threshold). The (R, G, B) values of any cell (pixel) is the sum of the influences of all the items in the grid in that cell.

Visual occlusion, viewing angles remain valid, but the agent can't "see" the exact location of the object, rather it sees a mixed up and diffused version, like a heat map.

does the visual field of the agent at time t depend only on the positions and types of the items at time t, or does it also depend on the visual field at time t - 1?

In the most general case it could be dependent on the previous time steps. But I'm interested in a case where the dependence is on the current time step only. That is, the visual diffusion only depends on the objects in the map at the current time step.

Let me know if you need more clarification on anything.

asaparov commented 3 years ago

I think I see what you mean. So would that mean that an item that is normally outside the visual field of the agent would be slightly visible? (since its color could diffuse into the agent's visual field if it's close enough) Also, would the occusion/FOV be computed before or after this diffusion/blurring step? Or are you disabling occusion/FOV for now? I'm assuming you want to use the same diffusion model as scent, except with different constants for decay/diffusion?

You could implement this using either approach I mentioned above (in Python or C++). The C++ code has a class called diffusion which would help here. You can see how this class is used in the update_state function I mentioned earlier (it's passed as an argument called scent_model). This function only computes the scent for one pixel (the agent's position), but in your case, you need to compute it for every pixel in the agent's visual field, and so you need a loop over pixels and a nested loop over nearby items.

You can also use the diffusion class as a reference to implement the same thing in Python.

NishanthVAnand commented 3 years ago

So would that mean that an item that is normally outside the visual field of the agent would be slightly visible?

Yes, you are right.

are you disabling occusion/FOV for now?

Yeah, I'm not using them for now. But it can be computed after computing the diffusion.

I'm assuming you want to use the same diffusion model as scent, except with different constants for decay/diffusion?

In a way, you are right. Vision part is similar to scent except it has information of "nXn" field (agent's view) rather than just a single cell.

Thanks for pointing to the specific part of the code. Do you think if it will be useful to have this feature in the original code (your repo)? I suspect a lot of other researchers will be interested in that setting.

asaparov commented 3 years ago

Hmm good question, I haven't been asked about this kind of functionality by other users.

NishanthVAnand commented 3 years ago

That's fair. One good paper is all it takes to popularize a particular experimental setting. Maybe there's a good one in making :P

NishanthVAnand commented 3 years ago

A couple more questions:

How did you generate Figure 9? Simply passing larger arguments to the map function doesn't seem to do the trick. It always returns 16 patches only.
How does Radial Hash work? If I want rewards to be distributed similar to Fig 9 but with smaller concentric circles, what parameters do I pass and why? Can you please provide some intuitions to it?

asaparov commented 3 years ago

Oh that's odd, map should definitely return more patches if you provide the correct inputs. What are the arguments you're providing to the function? Figure 9 was generated by just taking screenshots of the Vulkan visualizer. You need Vulkan and GLFW installed, but it should be fairly simple to build.

I think the best way to think about the RadialHash function is to start from a function like cos(sqrt(x^2 + y^2)). The cos causes the width of each "wave" to be constant. You can add a parameter s to control the width of the wave: cos(sqrt(x^2 + y^2) / s). For larger s, the concentric circles become thicker, and for smaller s, they become thinner. The RadialHash function is essentially identical to this function except the cos is replaced with a pseudorandom function. This pseudorandomness makes it more difficult for the agent to predict the distribution of items. The other parameters (c, k, \Delta) just translate the function in the x or y direction, or linearly scale it in the z direction (you can try messing with the example in WolframAlpha to gain further intuition).

NishanthVAnand commented 3 years ago

What are the arguments you're providing to the function?

bottom_left=(-500, -500), top_right=(500, 500) I am passing these arguments here.

Your explanations are very helpful. Thanks @asaparov

asaparov commented 3 years ago

You're welcome! Hmm, did you remember to reinstall jbw with python setup.py install after making those changes to environment.py?

NishanthVAnand commented 3 years ago

Actually, no. Do I have to reinstall every time I make some changes? Excuse me for a lack of good coding knowledge.

NishanthVAnand commented 3 years ago

Also, are you on Twitter? I'd like to add you there and keep in touch with you.

asaparov commented 3 years ago

Yes, if you're following the instructions in the README, you will need to reinstall jbw whenever you make changes to the library code. I'm pretty sure there are ways to import the library locally (without setup.py install), but I haven't tried those (I always just reinstall it after each change).

I think I technically have a Twitter account that I made many years ago but I haven't actually used it. 😅

NishanthVAnand commented 3 years ago

Figure 9 was generated by just taking screenshots of the Vulkan visualizer. You need Vulkan and GLFW installed, but it should be fairly simple to build.

Running cd bin and then ./jbw_visualizer simulates the world as the agent explores it. Is there a way to visualize the environment without tracking the agent? I want to visualize how the env looks for various parameter settings. What's the best way to do that?

asaparov commented 3 years ago

If you see ./jbw_visualizer --help, it will give you a bunch of options you can specify. You can stop tracking any agent if you just move the environment, for example by just dragging with a mouse. The visualizer tracks the first agent by default, but you can change this with the option --track=0. You can set the default zoom level using the option --pixels-per-cell. You may want to disable drawing the scent map if you're very zoomed out, since it could be quite expensive.

But the world is initially empty (we only generate the patches around the agents). So one way to generate more of the world initially (instead of waiting for the agent to explore enough) is to use this function. This function makes sure that all patches that intersect with the given bounding box are generated. So for example, if you're visualizing a local simulation (with the --local flag) you could add the line generate_map(sim.get_world(), position(-500, -500), position(500, 500)); after constructing the simulator here.

eaplatanios / jelly-bean-world

Rewards #9