YangLing0818 / SGDiff

Official implementation for "Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training" https://arxiv.org/abs/2211.11138
51 stars 6 forks source link

Test custom scene graph #12

Open HuilingSun opened 6 months ago

HuilingSun commented 6 months ago

Hi, Ling Yang If I want to test generating an image from a custom scene graph, What data should I need to prepare and which part of the code should I change?

marquies commented 4 months ago

I also want to test with my own scene graphs. I modified the testset_ddim_sampler.py to only load all the stuff and then created my own data:

objs =torch.LongTensor( [ 1,35,118,3,134,2,4,0 ] ).cuda()
#imgs =torch.LongTensor( []).cuda()
triples = torch.LongTensor( [[3,3,6], [3,3,2],[1,3,4],[3,1,1]] ).cuda()
obj_to_img = torch.zeros(8, dtype=torch.long).cuda()
triple_to_img = torch.zeros(4, dtype=torch.long).cuda()

to use it for the sampler (image generation):

     graph_info = [imgs, objs, None, triples, obj_to_img, triple_to_img]
     cond = model.get_learned_conditioning(graph_info)

Result is worse, but I don't know if it is on my model (trained to epoch 35) or the image. I wonder why I need to add the image to the data for the generation process.

Maelic commented 3 months ago

I also want to test with my own scene graphs. I modified the testset_ddim_sampler.py to only load all the stuff and then created my own data:

objs =torch.LongTensor( [ 1,35,118,3,134,2,4,0 ] ).cuda()
#imgs =torch.LongTensor( []).cuda()
triples = torch.LongTensor( [[3,3,6], [3,3,2],[1,3,4],[3,1,1]] ).cuda()
obj_to_img = torch.zeros(8, dtype=torch.long).cuda()
triple_to_img = torch.zeros(4, dtype=torch.long).cuda()

to use it for the sampler (image generation):

     graph_info = [imgs, objs, None, triples, obj_to_img, triple_to_img]
     cond = model.get_learned_conditioning(graph_info)

Result is worse, but I don't know if it is on my model (trained to epoch 35) or the image. I wonder why I need to add the image to the data for the generation process.

You need to train the model for much longer if you want to obtain good results, it took me roughly 8 days and 335 epochs to reproduce the authors' results, see https://github.com/YangLing0818/SGDiff/issues/7#issuecomment-1827581994.

You will also need to carefully design your custom scene graphs: the original VG dataset is highly unbalanced so the diffusion model does not learn efficient representation for all types of relations. In my experiments, it works relatively well to reconstruct images from graphs composed of spatial relations but it will fail with other more complex relations (such as semantics, e.g. "person eating sandwich", "person drinking wine" etc).