Open mschweizer opened 1 year ago
Hi @mschweizer, thanks a lot for bringing this up. @timbauman is actively working on this! Check out #712 and #711 if you want to stay updated about the progress.
Hey @ernestum, thanks for the reply! @mschweizer and I just looked into #711 / #712 and we think that our outline goes beyond what is covered there because we would like to support asynchronous gathering and add a more general support for web-based user interfaces. Since we have been working on this ourselves, we want to share our work in #716, which also integrates #712
Great! Let me know when the PR is ready for review!
thanks, we'll let you know asap :)
Our team KABasalt participated in last year's BASALT competition and we noticed that RLHP currently lacks support for human preferences.
Problem:
Only synchronous, synthetic preferences gathering is supported by RLHP.
There is only a synthetic gatherer. More fundamentally, PreferenceComparisons.train collects preferences synchronously meaning that PreferenceGatherer.__call__ blocks until all preferences for a set of queries are gathered. This is okay for synthetic preferences that are immediately returned by an oracle but problematic for human preferences that take time to answer. (There is also no user interface to interact with the human in order to present queries and collect preferences.)
Proposed Solution:
Making the training procedure asynchronous
We propose to extract the querying functionality from the
PreferenceGatherer
class into a newPreferenceQuerent
class and refine thePreferenceGatherer
interface accordingly. ThePreferenceQuerent
sends the queries to the user and returns a dictionary of all sent queries, each augmented by an ID. ThePreferenceGatherer
maintains a list of pending queries (with their IDs) and gathers available preferences for all pending queries from the user (or an oracle).The following code block shows how this would be used:
Human Preferences
The
HumanPreferenceQuerent
renders the videos for each query, uploads them to a cloud storage and communicates query ID and video URLs to the UI service. The UI allows the user to state their preferences for pending queries. TheHumanPreferenceGatherer
retrieves available preferences (identified by their ID) from the UI service and adds these to the dataset.We would follow the approach of rl-teacher which communicates via REST requests with the external, Django-based UI service and stores the trajectory fragment videos, that are rendered by the agent and displayed by the UI, in an S3-like cloud storage.
Synthetic Preferences
The existing implementation for synthetic preferences needs to be adjusted to comply with the API change described above. In case of synthetic preferences we use a base
PreferenceQuerent
that only assigns IDs to new queries. TheSyntheticPreferenceGatherer
collects preferences for all pending queries using an oracle, as before.Discussion / Alternative Solutions:
imitation
and the UI).