[Roadmap] DALL-E interface capabilities

vadimkatsman commented 10 months ago

Why (replace this text with yours) The reason behind the request - we love it to be framed for "users will be able to do x" rather than quick-aging hype-tech-of-the-day requests

The DALL-E API allows submitting:

prompt
number of images to generate
image and mask image

Prompt is obvious - this tool has it.

Number of images to generate - important giving the nature of generative tools - you need to many choices to take a proper refinement path. The option "generate again" does not cut it - since it removes the previously generated image from the presentation.

Image and mask image are important for image refinement.

When I tried to refer to prior image in my text prompt, I was kind of following ChatGPT UI conversation style not realizing it is gpt-guided UI which inferred my prompt into selection of an image as a base for the next attempt and dall-e does not keep conversational context. I am OK to execute the selection manually (if conversation surface would allow me to pick from previously generated images loaded in the chat) or manually upload (not replacing but adding to a text prompt) - but it is absolutely imperative to be able to supply a base image for refinement.

Mask image is important as well. Since it allows to take either just generated or even existing image, create a white region - and request to generate content into that region blending with the supplied image.

The model can do that. I am going to test API in a Juniper notebook locally to verify examples from the Internet. It is very important and useful capability of model - the tool must absolutely support it.

Description Clear and concise description of what you want to happen.

I can help brainstorming with you the exact UX of such capability - being a developer myself, having a sounding board participants help in product development, so if you need such brainstormer to help your efforts - count me in.

Requirements If you can, Please break-down the changes use cases, UX, technology, architecture, etc.

[ ] ...

enricoros commented 10 months ago

Hi @vadimkatsman , good ticket and very timely. As you know there are so many asks and request, but a better Drawing mode is absolutely what I'd like to provide next.

Do you check out /build from Source? If so, try out the main branch, where a new "Draw" App is beginning to take place (nothing much for now but it will grow).

I'm looking for UX ideas.

How would a user toggle between models (there's not just Dalle, but Prodia (stable diffusion), Together AI (in the near future), etc.

And how would a user toggle between modes. Such as generation, editing, etc.

Also, I just learned that edit and variations are supported in Dalle2 and Not Dalle3. https://platform.openai.com/docs/guides/images/image-generation?context=node So I guess socialization of the UI and good UX is even more important.

vadimkatsman commented 10 months ago

ChatGPT introduced the chat model and now everybody stuck with that notion of free-flowing conversation. Which works - I have to give a due credit - works wonders in text-based chatting. But even with text, ChatGPT is adding tool-based interfaces like tables, copying to Excel etc.

I am personally a big proponent of tools that support outcome / task oriented workflows. From that point of view, drawing should be less of a conversation and more like a tool - build capabilities around use cases of image generation and make the rest of questioning dependent on the task.

From that point of view, the question of the model is not the first question. Of course, the task starts with the scope based on preferences / selection of presets and additional finer print configuration of the task canvas. That scope includes 2 main selections - a preferred model family (Open AI vs Prodia vs. Together AI etc.). In each family the person selects the highest model for the project (dall e3 or dall e2) depending on the cost allowance, speed allowance and other constrains of the project.

The main interface in case of a TOOL (as opposed to CONVERSATION) is no longer limited to be a chat window at all but could be a specialized form for the request that starts with selection of the task (create, variate, edit, in-fill etc.). The rest of fields are task-based. If the top choice model does not support the task, the ui simply prompts is OK to downgrade model. Think about this way - generating original image in e3 and then variate or infill it using e2 heck a lot better to the final outcome than starting project from e2. The tool canvas has all freedom of the world to have a table (like a grid) of previous generations for branching the conversation to fine tune generation process or as a source for variations, editing and in-filling.

The interface based on global preferences is limited only to uniformal tasks. The choice of toolset (models, model parameters, etc.) should be as close to task in hand as possible. But would not require making selections every time either - a good balance between convenience and flexibility, between infinite number of possible permutations and needs of a specific project / task. I have an idea around such workflow - had this as too vague to share but after seeing it in LibreChat and ChatGPT Plus became more certain about its usefulness of the direction. I would recommend setting up a discussion so we can brainstorm it "offline" - not interfering with the business of the ticket. I am open to bring it "offline" even further - to have a collaboration calls over Zoom or Meets.

vadimkatsman commented 10 months ago

Do you check out /build from Source? If so, try out the main branch, I am probably not that familiar with github workflows since I cannot locate "/build". I can clone the repo but is it what you are referring to? Sorry for stupid question.

enricoros commented 10 months ago

Thanks this is good guidance. The "build" comment was referred on whether you build from source or use docker or the official website - how do you use the app? If you build yourself, you'll see a new "Draw" tool on the main branch, since yesterday.

I'm actually working on the tool today, and the name of the game is getting the best possible tool UI without spending a week on it full time.

If you have an UX/Design drawing, please share, would help during making the UI today.

vadimkatsman commented 10 months ago

Ah …

I did not build the tool locally nor using a Docker. I use the link from the official website.

and the name of the game is getting the best possible tool UI without spending a week on it full time. I hear you. Then focus on the outcome you are already set to get, by all means. The diagram I referred is along more long-term topic of organization and managing preferences, configuration, settings and setup for the specific “projects” – yep, I am using workflow-oriented terminology but putting the actual lingo aside it is around 3 levels of progression from general / global preferences to super specific prompts inside specific chats while being reusable at all levels of the workflow.

The immediate UI in mind for images is more like a CRUD screen with all active drawings and form for image parameters and model specs. It starts in the same way as regular chat – with system message – but depicts the list of generated artifacts and “New Image” tool.

That also necessitates different starting point on the navigation bar: “+” sign triggers popup menu to chose between conversation and drawing experiences.

Let me actually hand-draft the idea and I will share it with you.

From: Enrico Ros @.> Sent: Sunday, January 21, 2024 1:44 PM To: enricoros/big-AGI @.> Cc: Vadim Katsman @.>; Mention @.> Subject: Re: [enricoros/big-AGI] [Roadmap] DALL-E interface capabilities (Issue #355)

Thanks this is good guidance. The "build" comment was referred on whether you build from source or use docker or the official website - how do you use the app? If you build yourself, you'll see a new "Draw" tool on the main branch, since yesterday.

I'm actually working on the tool today, and the name of the game is getting the best possible tool UI without spending a week on it full time.

If you have an UX/Design drawing, please share, would help during making the UI today.

— Reply to this email directly, view it on GitHubhttps://github.com/enricoros/big-AGI/issues/355#issuecomment-1902740437, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANSW6JHSILV6G2AUX3POUXTYPVVXBAVCNFSM6AAAAABCDTHTDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSG42DANBTG4. You are receiving this because you were mentioned.Message ID: @.**@.>>

enricoros commented 10 months ago

Let me actually hand-draft the idea and I will share it with you.

Yes, thank you! I've put it here: https://next.big-agi.com/ <- take a look at the upcoming "Draw" on the left bar. That's as much as I have. Please share the hand draw.

vadimkatsman commented 10 months ago

I see which direction you are going. And I also would warn you the attached document of overall organization of work is based on different philosophy of interface. I will elaborate in a minute.

Some tools are built around capabilities and what technically a tool can achieve and some are built around what people are using tools for and how. Try to guess which tools survive a test of time? At the same time, some tools are used by audiences with different needs and different reasons for using the tool.

So much on abstract academic conversation.

You are clearly building the tool as a wrapper around API. Also, a lot of your users by definition are geeks - who else would go and play with API keys etc. - not only technically savvy business users but people who prefer tinker with the tool savoring the tool itself. I am referring to that not as a criticism but as a realization that maybe your approach will be more beneficial for that audience.

But some people like me are looking for an alternative to ChatGPT Plus but keeping the same focus as the "original" UI is - the tool that facilitates working on content for professional use (the tool hast to fit into the overall content creation narrative and organize work accordingly). I am less interested in exploring models as much as I am focused on making the tool conducive to my work - to make it an inherent part of my daily workflow, in which AI is just an assistant.

Case in point, which I often use, - Chat GPT vs Bard. In many respects Bard is more sophisticated AI but as a tool it was a joke - its facilities prevented me from using it beyond simply playing - I could not use it for multi-week content production (primarily writing articles, blog posts, outlines of presentations on various topics etc.). This is a fate for many tools - they focus on the technical guts and less on tasks that users must accomplish with the tool.

There are few major points in the diagram I have shared. 1) In your new direction, your user's starting point is the choice of tool - and then to live in parallel universes of contexts and saved work. In my case everything flows for the purpose of the outcome. This is why the choice of tool is placed at the terminal end of the workflow - for a single project I may need to use multiple tools but all share common context and the work in progress should be available together, 2) I am placing the outsized importance on the context. The main secret of successful prompt engineering is successfully created context (custom instructions, system messages - whatever they are called in each case). Most of effort goes into the context.
3) Also, people who work on multiple items of the content see similar context showing up over and over. But not exactly the same - a persona might be the same but one day I am working on the pitch for small shop and another day I am blogging about multi-national corporation. I can work on the series for multi-national corporation but from prospective and style of CIO and another day from prospective and style of CIO etc. - you got a picture. Number of permutations is huge but the number of building blocks is relatively finite. 4) Folders are good organizing method - but the project is more than a collection of grouped conversations - it needs to maintain the common context. On a technical level, the folder is the root of context and creation settings. Assembled from presets and defaults but to be inherited by all conversations STARTED under the project (with the flexibility of overrides, adjustments and additions). For example, when I work on supply chain topic within the project of IT for multi-national corporations, I will use project's system message and add one more sentence to it - to add to the context the topic of interest. Why supply chain cannot be the project on its own? It could. But also it could be a topic within a larger project. 5) Folders are still needed - within a projects we still need to group conversations by a topic: Project: IT for small business Topic (folder): supply chain Conversation / tool / desired outcome: outline text Conversation / tool / desired outcome: introductory paragraph Conversation / tool / desired outcome: blog post 1 Conversation / tool / desired outcome: hero image Conversation / tool / desired outcome: thumbnail image and so on 6) ChatGPT Plus solved it in a very similar way - the call it custom GPTs - isolated bubbles of chats grouped by the common context and model setup. In fact, I think Chat GPT did it in response to LibreChat. 7) The suggested flow brings the final setup of the model as close to ultimate need as it gets. Assisting with the presets selected for the project, I may still adjust a temperature for blog post distinct from a proposal brief 8) Despite looking intimidating, if users with the focus on the single task (like routinely using the tool to generate a code) don't need that complexity they will simply not use it - nobody requires to do an effort of building presets - unnamed presets are always there. But if a person happens to work with different languages - here is the project usage for you - per language so you don't need to type which language to generate code into - you would just create a chat in appropriate "project". 7) Import / export becomes more meaningful also. Transferring entire project to another device would allow uninterrupted work since while it may not bring presets, the project contains its own context - presets are needed for future projects but the given project is fully on its own. 8) Word "project" may raise objectives but this is what it truly is! From the content creator's prospective, it is the scope of deliverables, tasks, outcomes, and artifacts that need to be produced in organized and progressively moving way. 9) And it is a really work in progress. There will be multiple parallel projects and there will be long running projects.

Saying everything that I said, there are two fundamental approaches - starting from the tool and starting from the goal. And both have merits. If projects complexity is overall low and people are coming from specific things or two, starting from the tool makes sense. As an environment for context creation - starting from the goal is needed.

How to reconcile? I would suggest a choice of UX theme - "API Explorer" and "Context Creator". API explorer just jumps to a needed tool, do a chat or two and come whenever to generate few more things. Content creator needs more organized way of materials and workflows.

Big-AGI-Overallpdf.pdf

enricoros commented 10 months ago

@vadimkatsman thanks in advance for the thoughtful work. I have initial thoughts but will process and give adequate feedback tomorrow.

enricoros commented 10 months ago

Update. Still digesting this. Very insightful information, thanks for putting it together.

As far as the audience, we want to target professional/workflow users. Due to the nature of the market we get overwhelmed with requests to support obscure networks and tinker with every possible parameter, but many of the large changes in the roadmap are to offer cohesive workflow experience.

To this, I value your analysis very much. I'm reconciling your ideal workflow with other alternatives we have in mind ('workspaces' will be a shared document collection, 'patterns' a workflow builder and runner).

I'm sure that if we pivoted the app to fully implement your architecture many of the users will bail, but many more would come, so I'm giving it deep consideration:

[ ] the hierarchy of presets seems like something we shall strive to get, quickly
[ ] for system prompt configuration (which is absolutely important, and important to get right) I'd love to have a quick settings pane where you get to set it, and also mix-in tools and styles or other directives, with a live preview of the prompt (the opaqueness of GPTs is what hunts them, IMO)
[ ] do you find yourself with repeated workflows for which you'd like conversation (Rather than message) templates? I need to make this happen quickly as well

vadimkatsman commented 10 months ago

I'm reconciling your ideal workflow with other alternatives

It is not so much ideal as what I recognized as need.

There are plenty of scenarios I did not cover in my thoughts. For example, repeating projects. Ongoing / in progress projects are essentially ongoing conversations. But imagine building monthly newsletter. Each newsletter is its own deliverable, but done in similar / reusable way. That gives one more level of reusability - which you have mentioned as well - "project templates". Or at least build project based on previous one. And so on.

The issue many everyday gpt users face is to organize all of these chats. Over time materials grow - flat collection of anything will become a hindrance rather soon.

So, as an architectural suggestion, to have a common way of managing artifacts (of any kind) - common UI, ability to organize in folders with unlimited levels, ability to sortable / filterable tags, ability to copy (shallow copy for entry in the list and deep copy for objects underneath), move between folders, and some other typical management capabilities.

do you find yourself with repeated workflows for which you'd like conversation

This is what made me thinking. Here is what I am embarked on (as a series of projects):

I have 3 target markets - small businesses, mid-size and large corps
I pitch to CIOs, CFOs, and COOs
I need to produce blogs, updated blocks on my site, elevator pitches, social media posts
I have about 24 different business areas to talk about On each combination, there are plenty to topics to work on.

And this is only one series of projects. There are other groups of projects - professionally, I may use code generation, I am a member of several consulting organizations - each may require its own series of projects. And I am sure I will have enough unaffiliated chats.

enricoros / big-AGI

[Roadmap] DALL-E interface capabilities #355