Store full source metadata

evazion commented 5 years ago

We know how to extract lots of metadata from sources using source strategies, but most of this data gets thrown away. Instead it should be permanently stored in a table somewhere.

Specifically, for every source we should extract and store the following:

site name (Pixiv, Twitter, etc)
work id (Pixiv illust id, Twitter status id, etc)
artist id
artist username
tags (original Pixiv tags, Twitter hashtags, etc)
upload date (date the post was uploaded to Twitter / Pixiv / etc)
image url
page url
page number (for images part of a Pixiv / Twitter batch)
status (whether the source is active or deleted)

This would facilitate a number of things:

Searching posts by Twitter/Pixiv/etc tags.
Searching posts by Twitter/DeviantArt/etc IDs (#3924).
Sorting posts by the date they were originally posted on Twitter/Pixiv/etc (#3899).
Finding posts that are part of the same Pixiv or Twitter gallery.
Finding untagged posts by the same artist.
Simplifying artist lookups (find artists by artist id instead of trying to match profile urls).

r888888888 commented 5 years ago

Maybe this is a good candidate for Postgres's JSON indexing support.

GlassedSilver commented 5 years ago

I love this idea. Especially things like upload dates can really come in handy.

danbooru / danbooru

Store full source metadata #4113