Support human preferences in “Deep RL from human preferences” (RLHP) implementation

mschweizer commented 1 year ago

Our team KABasalt participated in last year's BASALT competition and we noticed that RLHP currently lacks support for human preferences.

Problem:

Only synchronous, synthetic preferences gathering is supported by RLHP.

There is only a synthetic gatherer. More fundamentally, PreferenceComparisons.train collects preferences synchronously meaning that PreferenceGatherer.__call__ blocks until all preferences for a set of queries are gathered. This is okay for synthetic preferences that are immediately returned by an oracle but problematic for human preferences that take time to answer. (There is also no user interface to interact with the human in order to present queries and collect preferences.)

Proposed Solution:

Making the training procedure asynchronous

We propose to extract the querying functionality from the PreferenceGatherer class into a new PreferenceQuerent class and refine the PreferenceGatherer interface accordingly. The PreferenceQuerent sends the queries to the user and returns a dictionary of all sent queries, each augmented by an ID. The PreferenceGatherer maintains a list of pending queries (with their IDs) and gathers available preferences for all pending queries from the user (or an oracle).

The following code block shows how this would be used:

queries = self.fragmenter(trajectories, self.fragment_length, num_queries)
identified_queries = self.preference_querent(queries)
self.preference_gatherer.add(identified_queries)
answered_queries, preferences = self.preference_gatherer()
self.dataset.push(answered_queries, preferences)

Human Preferences

The HumanPreferenceQuerent renders the videos for each query, uploads them to a cloud storage and communicates query ID and video URLs to the UI service. The UI allows the user to state their preferences for pending queries. The HumanPreferenceGatherer retrieves available preferences (identified by their ID) from the UI service and adds these to the dataset.

We would follow the approach of rl-teacher which communicates via REST requests with the external, Django-based UI service and stores the trajectory fragment videos, that are rendered by the agent and displayed by the UI, in an S3-like cloud storage.

Synthetic Preferences

The existing implementation for synthetic preferences needs to be adjusted to comply with the API change described above. In case of synthetic preferences we use a base PreferenceQuerent that only assigns IDs to new queries. The SyntheticPreferenceGatherer collects preferences for all pending queries using an oracle, as before.

Discussion / Alternative Solutions:

We could keep the current design but the procedure would then remain synchronous and depend on human availability.
Which third-party web technologies to use in order to interact with the user is up for discussion. It also needs to be specified how to “deliver” and deploy the system as a whole (imitation and the UI).

ernestum commented 1 year ago

Hi @mschweizer, thanks a lot for bringing this up. @timbauman is actively working on this! Check out #712 and #711 if you want to stay updated about the progress.

rk1a commented 1 year ago

Hey @ernestum, thanks for the reply! @mschweizer and I just looked into #711 / #712 and we think that our outline goes beyond what is covered there because we would like to support asynchronous gathering and add a more general support for web-based user interfaces. Since we have been working on this ourselves, we want to share our work in #716, which also integrates #712

ernestum commented 1 year ago

Great! Let me know when the PR is ready for review!

rk1a commented 1 year ago

thanks, we'll let you know asap :)

HumanCompatibleAI / imitation