LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Information

Authors: Ning Yu+
Organization: Salesforce Research+
Paper:
Code: PageNotFound
Conference/Journal:

Summary

サマリ図表

どんな論文か？

Background画像の要素と干渉しないように、複数のforegroundテキスト・画像要素を配置する手法の提案。アーキテクチャとしてはDETRに近く、object query embeddingの代わりにforegroundテキスト・画像の埋め込みを使用する（つまりここでのqueryはforegroundテキスト・画像要素）。埋め込み表現はLayout VAE＋『文字列、テキスト長、属性(header, bodyなど)』のEncoding結果を使用。クオリティの高いレイアウト生成のために細かく設計されたGAN、VAEのLossに加え、queryレベルのbbox推定をobjectiveにしたのがキモ。ベースラインに対してLayout FID/Image FIDともに高いスコアをマーク。

新規性

Layout生成タスクとObject Detectionタスクを融合したアーキテクチャおよび学習
Ad Banner Datasetの提案
- 広告画像から前景文字・画像の抽出、inpaitingを施しbackground imageを作成

結果

その他（なぜ通ったか？など）

background imageはgivenで、それと干渉しない配置を目指している
foreground textにはheader, body, disclaimer, buttonのattributeがある
GANのbranch
- conditional, unconditional両方のadversarial lossをとる
- discriminator featureにdecoderを追加しaux lossもとる
- position insensitiveになるのを防ぐbboxのreconstruction
- background imageのreconstruction
- foreground image patchのimage reconstruction
- textのreconstruction
VAE part
- layoutのreconstruction
- KL div（いつもの）
GANとVAEを併せて最適化
bbox lossを入れるのが重要
ablation study
- Table1でtext class embeddingsを抜くと精度が大きく落ちるのは興味深い
- attributeがlayoutには重要
Ads banner datasetを構築
LayoutFID, ImageFIDで評価
Table2みるとGANだけで良いのではと思う
Lossがエグめ

AtsukiOsanai / cv_survey

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer #87