QUT-Digital-Observatory / tidy_tweet

Tidies Twitter json collected with Twarc into relational tables.
https://pypi.org/project/tidy-tweet/
MIT License
3 stars 0 forks source link

Proposal: keep user profile properties like URLs and hashtags distinct from tweet hashtags and URLs #12

Open SamHames opened 2 years ago

SamHames commented 2 years ago

Apart from the structural differences between tweets and user profiles (a single text blog + attachments for tweets, multiple fields with different meanings/purposes for user profiles), URLs, hashtags and mentions in profiles have different interpretations and meanings than the same thing in tweets. The current schema conflates these, making every query that needs to touch these elements more complicated to avoid mistakes.

Additionally, this schema means that we can't use foreign key constraints to maintain referential integrity.

As an example of this for hashtags, I propose we change the current schema:

create table hashtag (
    source_id text, -- the id of the object (user or tweet) this hashtag is included in
    source_type text, -- "user" or "tweet"
    field text not null, -- e.g. "description", "text" - which field of the source
                         -- object the hashtag is in
    tag text not null
)

To the following two tables:

create table if not exists tweet_hashtag (
    id integer foreign key references tweet(id),
    hashtag text,
    -- Normalisation, to allow indexing on common transformations of hashtags to match Twitter platform affordances
    -- Twitter makes no distinction between #SuperBowl, #SuperbOwl and #SUPERBOWL
    hashtag_lower text,
    primary key (id, hashtag)
);

create table if not exists user_profile_hashtag (
    id integer foreign key references tweet(id),
    field text,
    hashtag text,
    hashtag_lower text,
    primary key (id, field, hashtag)
)