Decide which datasets we will use

Humorloos commented 2 years ago

You should integrate:

at least 3 different data sets
at least 2,500 entities described in total (in joint dataset) − but more are better, good: >10,000 but <100,000
at least 1000 entities should be contained in at least two datasets − please estimate based on small sample
at least 8 attributes in joint dataset − entities should be identifiable by attribute combinations of at least two attributes, e.g. name+birthdate
at least 5 attributes should be contained in at least two datasets − some attributes (other than name) should be contained in three datasets (for fusion by voting)
ideally, at least one of your attributes is a list attribute – actors of a movie, directors of a company, songs on a CD

Humorloos commented 2 years ago

Example idea that probably satisfies requirements: Suggestion: Twitter datasets

Twitter friends (https://www.kaggle.com/hwassner/TwitterFriends)
- avatar: URL to the profile picture
- followerCount: the number of followers of this user
- friendsCount: the number of people following this user.
- friendName: stores the @name (without the '@') of the user (beware this name can be changed by the user)
- id: user ID, this number can not change (you can retrieve screen name with this service : https://tweeterid.com/)
- friends: the list of IDs the user follows (data stored is IDs of users followed by this user)
- lang: the language declared by the user (in this dataset there is only "en" (english))
- lastSeen: the time stamp of the date when this user have post his last tweet.
- tags: the hashtags (whith or without #) used by the user. It's the "trending topic" the user tweeted about.
- tweetID: Id of the last tweet posted by this user.
Twitter User Gender (https://www.kaggle.com/crowdflower/twitter-user-gender-classification)
- unitid: a unique id for user
- _golden: whether the user was included in the gold standard for the model; TRUE or FALSE
- unitstate: state of the observation; one of finalized (for contributor-judged) or golden (for gold standard observations)
- trustedjudgments: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations
- lastjudgment_at: date and time of last contributor judgment; blank for gold standard observations
- gender: one of male, female, or brand (for non-human profiles)
- gender:confidence: a float representing confidence in the provided gender
- profile_yn: "no" here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it
- profile_yn:confidence: confidence in the existence/non-existence of the profile
- created: date and time when the profile was created
- description: the user's profile description
- fav_number: number of tweets the user has favorited
- gender_gold: if the profile is golden, what is the gender?
- link_color: the link color on the profile, as a hex value
- name: the user's name
- profileyngold: whether the profile y/n value is golden
- profileimage: a link to the profile image
- retweet_count: number of times the user has retweeted (or possibly, been retweeted)
- sidebar_color: color of the profile sidebar, as a hex value
- text: text of a random one of the user's tweets
- tweet_coord: if the user has location turned on, the coordinates as a string with the format "[latitude, longitude]"
- tweet_count: number of tweets that the user has posted
- tweet_created: when the random tweet (in the text column) was created
- tweet_id: the tweet id of the random tweet
- tweet_location: location of the tweet; seems to not be particularly normalized
- user_timezone: the timezone of the user
Twitter covid19-tweets (https://www.kaggle.com/gpreda/covid19-tweets)
- 13 cols in total

ashishrana160796 commented 2 years ago

Th is approach looks good to me @Humorloos and I think we should try exploring other datasets that are available easily or can be constructed easily like w/ APIs etc. Basically, my guess is to avoid including the uncertainty to crawl tables on websites.

I think we might face an issue w/ respect to the 2nd condition if we constructed the joint dataset at user_id level. My guess is that finally we have to combine these datasets it can be a join or append. I am not sure on that.

But, again the ideation direction of using twitter datasets looks good. thanks!

Humorloos / IE683

Decide which datasets we will use #2