Closed GongYanfu closed 4 months ago
As outlined in our paper, we accomplish this by modifying the objective of an offline RL algorithm to emphasize the minimization of expected returns, rather than their maximization.
More specifically, we incorporate a negation, represented by a "-" symbol, into the loss function of the policy gradient function.
can you share this part of code? please
---- Replied Message ---- | From | @.> | | Date | 05/16/2023 18:33 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [2019ChenGong/Offline_RL_Poisoner] puzzle about getting the weak-performing agent (Issue #3) |
s outlined in our paper, we accomplish this by modifying the objective of an offline RL algorithm to emphasize the minimization of expected returns, rather than their maximization.
More specifically, we incorporate a negation, represented by a "-" symbol, into the loss function of the policy gradient function.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
It's actually quite straightforward! Please begin by identifying the section of the code responsible for computing the actor loss. You will find that various algorithms employ different code structures for this computation. To illustrate, refer to line 121 in the file 'sac_impl.py'. Additionally, it's perfectly fine to modify the expression '(entropy - q_t).mean()' to '(- entropy + q_t).mean()'.
thanks for your answer. but I still have a question as bellow: If I want to use the discrete bcq to train an agent for getting a weak-perfroming one. How should I modify the loss fucntion? or need to find another way?
I'm not entirely certain! I would recommend that you identify the section of code responsible for computing the actor loss by utilizing debugging techniques.
but not all DRL algorithms have actor-critic structure
Simply modifying the training of value function would be great.
but not all DRL algorithms have actor-critic structure
— Reply to this email directly, view it on GitHubhttps://github.com/2019ChenGong/Offline_RL_Poisoner/issues/3#issuecomment-1851181523, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL2DA2UK5NQ3JC43JLPGY3LYI62XRAVCNFSM6AAAAAAYC6SSA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJRGE4DCNJSGM. You are receiving this because you commented.Message ID: @.***>
To confirm: in order to train the weak-agent used to get the target actions when poison offline data, we need to minimize its cumulative reward? I changed the actor loss of sac by '-' symbol and use it to train PongNoFrame-v4, the reward is always -21 and actor loss is up, is it normal?
The observation will be different across different tasks. Right! it is normal.
I think the newest version of the code of thie work is the repo "Review".
but before adding treigger into the datatset, we need to get the weak-performing agent. I dont find the code of how to get the weak-performing aget.
can you give me some tips or share the related code of getting the weak agent? thanks a lot.