Open SxJyJay opened 1 year ago
Thank you for your interest. Regarding pretraining adaptation and human alignment, we have indeed only released the checkpoints.
If you would like to train the other two parts, the training code is identical to that of task-specific training. The difference lies in the data. Pretraining adaptation does not use ground truth information but randomly assigns circles or adds some masks. As for human alignment, you can refer to Section 3.3 of our paper for the methods of collecting this data.
Thanks for your timely response. I have some questions regarding details of the pretraining adaptation phase. 1. You mentioned that "Pretraining adaptation does not use ground truth information but randomly assigns circles or adds some masks.", does it mean that you randomly paint some masks or points on the original images and add instructions like "with a few different color patches here and there” or ”surrounded with a red circle." to the original captions? How can you guarantee the generated masks or circles can cover the objects or position corresponding to the target objects in captions? 2. Since you adopt segmentation or keypoint detection datasets for this phase, what do captions here refer to? Do they refer to the segmentation categories or keypoint types?
Thank you for your question and interest in our work. The primary goal of pretrain-adaptation is to adjust the diffusion output distribution, allowing it to generate masks or keypoint indicators smoothly. Therefore, we utilize the text2img generation task in this stage without providing object or keypoint information in the text prompt. The text prompt is defined by a combination of image caption and instruction. Here are some additional details:
I hope this clarifies your concern. Please let me know if you have any further questions or need additional information.
Thanks for your reply! I also have some questions about the prompts you used to generalize detection task. I have tried the prompt like "Create four yellow circles to cover the top, bottom, left and right boundaries of the boy, while keeping the other pixels constant", but the output result is still a mask. Could you please provide some examples of prompts for detection?
We are using prompt that resembles referring segmentation to derive the bounding box according to mask regions. Please refer to Section 4.9 in our paper for more information.
Thanks for your reply! During preparing the training data, I only find the GIER.json file from its official website, while in the dataloader, I found you use the GIER_new.json. I use the GIER.json to substitute GIER_new.json in the source code, but find that GIER.json doesn't have the "prompts" filed, therefore, an KeyError below will be raised: "/editing/edit_zip_dataset.py", line 227, in init prompts=self.meta[i]["prompts"], KeyError: 'prompts'" Could you please tell me where I can find the GIER_new.json, or what postprocessing has been conducted on the original GIER.json to generate GIER_new.json?
Also, could you please provide meta_info.json file for GQA-inpaint dataset? This file doesn't exist in the original dataset downloaded from its official site. Thank you very much!
Hi! I also have a question about the adaptation stage. Is the adaptation formulated as text2image generation or image+text2image generation?
Thanks for your great work! I notice that your config file load the checkpoint "v1-5-pruned-emaonly-adaption.ckpt", which seems correspond to the "Pretraining adaptation" phase in your main paper. Meanwhile, it seems that the training code provided only include "Task-specific training" phase. I cannot find the training codes of the other two phases. Do I miss something? Could you please give some suggestions or share the training code of another two phases?