Humorloos / IE683

0 stars 0 forks source link

Decide which datasets we will use #2

Closed Humorloos closed 2 years ago

Humorloos commented 2 years ago

You should integrate:

  1. at least 3 different data sets
  2. at least 2,500 entities described in total (in joint dataset) − but more are better, good: >10,000 but <100,000
  3. at least 1000 entities should be contained in at least two datasets − please estimate based on small sample
  4. at least 8 attributes in joint dataset − entities should be identifiable by attribute combinations of at least two attributes, e.g. name+birthdate
  5. at least 5 attributes should be contained in at least two datasets − some attributes (other than name) should be contained in three datasets (for fusion by voting)
  6. ideally, at least one of your attributes is a list attribute – actors of a movie, directors of a company, songs on a CD
Humorloos commented 2 years ago

Example idea that probably satisfies requirements: Suggestion: Twitter datasets

ashishrana160796 commented 2 years ago

Th is approach looks good to me @Humorloos and I think we should try exploring other datasets that are available easily or can be constructed easily like w/ APIs etc. Basically, my guess is to avoid including the uncertainty to crawl tables on websites.

I think we might face an issue w/ respect to the 2nd condition if we constructed the joint dataset at user_id level. My guess is that finally we have to combine these datasets it can be a join or append. I am not sure on that.

But, again the ideation direction of using twitter datasets looks good. thanks!