Can an agent learn valid actions offline, being able to choose only actions that were already taken (e.g. from historical data) ? [question]

VieVaWaldi commented 4 years ago

Hi,

Can anyone give me advice on training an RL agent, that can choose actions only from a given data set.

I am working on a control system problem. I have collected half a year worth of data about a machine that produces parts. The data contains setpoints, measurements and information about the quality of the produces parts.

For safety reasons the agent can not learn online. Therefore the agent needs to learn offline on the historical data. However, i can not wrap my head around an agent that produces valid setpoints as actions.

There are multiple papers that implement an agent offline, e.g. https://arxiv.org/pdf/1709.05077.pdf, however i do not understand how the agent chooses an action in these implementations.

Cheers,

Walter Ehren

dcxnba commented 4 years ago

Hi, I can't give some advice as I'm just a sophomore at neuq(China).And I have learned RL for half a year.I'm trying to learn more about how to use RL.

Hope you can find someone to solve your problem.

Best wishes, DCX

------------------ 原始邮件 ------------------ 发件人: "WalterEhren"<notifications@github.com>; 发送时间: 2020年1月7日(星期二) 晚上6:59 收件人: "dennybritz/reinforcement-learning"<reinforcement-learning@noreply.github.com>; 抄送: "Subscribed"<subscribed@noreply.github.com>; 主题: [dennybritz/reinforcement-learning] Can an agent learn valid actions offline from a limited set of available actions ? [question] (#218)

Hi,

Can anyone give me advice on training an RL agent, that can choose actions only from a given data set.

I am working on a control system problem. I have collected half a year worth of data about a machine that produces parts. The data contains setpoints, measurements and information about the quality of the produces parts.

For safety reasons the agent can not learn online. Therefore the agent needs to learn offline on the historical data. However, i can not wrap my head around an agent that produces valid setpoints as actions.

There are multiple papers that implement an agent offline, e.g. https://arxiv.org/pdf/1709.05077.pdf, however i do not understand how the agent chooses an action in these implementations.

Cheers,

Walter Ehren

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

VieVaWaldi commented 4 years ago

Thanks anyway for reading the question :)

mobil787 commented 4 years ago

Usually people create a simulation environment to train the RL system (not in the real world as you called online). In your situation, since lots of data have already collected about the machine performance. You might build a simulator of the machine and train the RL system.

Hope it helpful.

VieVaWaldi commented 4 years ago

The machine is quite complex and it is impossible to create a realistic simulation. E.g. there are multiple stations and one has up to 100 set points and multiple measurements.

I tried to create a data driven simulation using only the tabular data. My main problem is, if the agent chooses random actions (setpoints), i cannot provide valid measurements (observations). I can only provid measurements (observations) for actions (setpoints) that have been used in the machine prior.

The ultimate goal would be an algorithm that suggests optimal setpoints for the machine. Do you by chance have other ideas that could work?

mobil787 commented 4 years ago

It looks that the machine performance is unable to be described somethings as analytical function, so in one State (or you called station) , the action could be up to 100 setpoints and you only have a few measured actions-state data (that's my understanding based on your words). The only ideal I can image is that assuming the optimal setpoint is all from the historical data, then you can run the RL simulation based on tabular data as you described (which means it might have only a few action choices on each state, not 100 ones).

VieVaWaldi commented 4 years ago

My mistake. Station is an actual part of the machine. The machine consists of multiple stations the produced part has to pass.

I dont think the optimal set of actions lie in the dataset. Howeverm the algorithm should find the optimal set of actions that the data could provide. I should add, that from the 100 possible actions, there are probably only a couple ones that are really important.

dennybritz / reinforcement-learning

Can an agent learn valid actions offline, being able to choose only actions that were already taken (e.g. from historical data) ? [question] #218