Save and Search Social Media Posts

blackforestboi commented 6 years ago

Oftentimes we as users want to save or find a specific post or comment on social media again. We need a “save to Memex” button for the major social networks (FB, TW, Pinterest, Reddit) As preparation for GSoC, we’d start with a simple(r) implementation for Twitter posts. Important: The implementation of this needs to be as generic as possible, that allows us to expand to other social networks easily, as well as not only saving the high level post, but also single comments later.

Note these are merely suggestions and for your proposal, you might find other things to be important to be done.

Preparation

[ ] Understanding relevant code base and ask @poltak on ideas how to integrate lists into the index & query. (please keep discussions in this issue as much as possible)
[ ] Investigating a way to store content in DB, index and make searchable (potentially with a new content type (post/twitter); @poltak your input might be needed here). In the easiest case we would for now store as much as possible (author, #likes #retweets, extraced comments and whatever we can get) into a pouch doc, without making all data searchable, except for the text, urls etc. The indexing structure will soon change anyhow with the work on https://github.com/bluesun/fastindex. After that switch has happened, we can make more search features possible)
[ ] Laying out a concrete plan on how to proceed and post it as a comment in this issue (Describe high level goals, processes and approach of implementation, data model)
[ ] Gather feedback from community
[ ] Implement :grinning:

Goal V1:

Meaningful conceptual and/or implementation progress before March 12 is expected

[ ] Add Save to Memex Button to each tweet
[ ] Grab content of a single tweet
[ ] Index Content of Tweet
[ ] Store all Metadata you can get in the related pouch document, even if not indexed yet (e.g. Author/user, time posted)
[ ] Enabling searchability
[ ] Make addition process as generic as possible, so we can add other kind of posts (e.g. Facebook, or comments)

Goal V2:

[ ] Filter by content type (post, page, tweet)
[ ] Add Facebook posts
[ ] Add Button to single Twitter replies / facebook comments > and allow them to be stored separately.
[ ] Automatically store content that has been liked or shared

Goal V3:

[ ] Enabling automatic storing of all tweets/posts a user sees in his feed.

Prerequisits

In order to work on this task, you should bring experience and interest for

Data structures / Data Models
Search algorithms
Databases

ansh103 commented 6 years ago

@oliversauter and @poltak I am interested in working on this.
So moving on to my analysis so far:- According to your preparation list, first comes 'Understanding the relevant code base' a) Mechanism:- What the button does? Saves a tweet. So where so it be saved? Since Memex analyzes pages on basis of history. We already have interface to store that. Now the question is whether we want it to be with our web pages storing directory or a new one for just all kinds of Social Media (Twitter,FB etc) b) UI for button:- Setting up a clean looking small button with Listener to save Tweets. We can have a drop down list in the button to do various tasks. Like "Save Tweet", "Retweet" "Schedule a Retweet" and so on. This is a broad basis of what I understood. Looking for more detailed analysis in comments/discussion.

raincrash commented 6 years ago

For the twitter intergration part, @oliversauter and @poltak, you thoughts on adding an oauth integration with twitter, using the streaming API when the user is on the twitter homepage, and then add listing, indexing and searching of the stream? We get decent metadata this way, and also would reduce the amount needed to physically crawl though the ever-changing twitter UI. (This is similar to GOAL V3 mentioned above.)

I will look into single tweet grab meanwhile.

blackforestboi commented 6 years ago

@raincrash > you thoughts on adding an oauth integration with twitter, using the streaming API when the user is on the twitter homepage

Are you referring here to 'Goal V3'? This may indeed be a valuable approach to the problem, especially for the newsfeed view. Wonder what kind of data we could get out of there and how useful it may be in terms of us also indexing the tweets/posts a user has actually seen.

blackforestboi commented 6 years ago

Now the question is whether we want it to be with our web pages storing directory or a new one for just all kinds of Social Media (Twitter,FB etc)

Good question to @poltak

UI for button:- Setting up a clean looking small button with Listener to save Tweets.

I think @raincrash already has some ideas here, right?

Like "Save Tweet", "Retweet" "Schedule a Retweet" and so on.

I think Retweet and Scheduling a tweet is a bit out of scope here. We really just want to enable people to save a tweet to Memex.

@ansh103 have you found the relevant pieces in the code enabling you to theoretically save a tweet? What kind of adjustments to the code/data structure do you think are necessary?

bohrium272 commented 6 years ago

@oliversauter I've worked with the Twitter API so I'd like to add that OAuth is a viable way of doing this. One tweet object from the API can provide the exact same information as what you see on your screen, just in a more programmer friendly way. On the other hand there are a number of chrome extensions like Grammarly etc that are able to manipulate Twitter's DOM. We need to insert a button only so I think asking the user for access to all their tweets is not sustainable.

blackforestboi commented 6 years ago

We need to insert a button only so I think asking the user for access to all their tweets is not sustainable.

I agree for the first few features it is not necessary. For "Goal V3" it may get interesting

raincrash commented 6 years ago

Yes, as @arpitgogia mentioned, Streaming API + OAuth would be a very valuable resource for 'Goal V3'. I have worked on both of them for a different project. The information also includes a lot of metadata like geotagging, trends and metrics. Not sure about the "how useful it may be in terms of us also indexing the tweets/posts a user has actually seen." part. We could try to list/index everything first and then replace the unnecessary parts?

I think @raincrash already has some ideas here, right?

I am currently looking into the DOM manipulation first, before adding a UI button on the content.

We need to insert a button only so I think asking the user for access to all their tweets is not sustainable.

I am wondering about this part too. Indexing only the tweets that user physically asks us to is a more viable approach than indexing all the tweets the user have seen, correct?

blackforestboi commented 6 years ago

Indexing only the tweets that user physically asks us to is a more viable approach than indexing all the tweets the user have seen, correct?

For now, definitely.

bohrium272 commented 6 years ago

I am wondering about this part too. Indexing only the tweets that user physically asks us to is a more viable approach than indexing all the tweets the user have seen, correct?

We can have this as a setting. Auto or manual indexing.

I agree for the first few features it is not necessary. For "Goal V3" it may get interesting

In a broader sense, querying the Twitter API is one additional task whereas if we are able to insert a button in every tweet, we can get the corresponding text as well.

raincrash commented 6 years ago

In a broader sense, querying the Twitter API is one additional task whereas if we are able to insert a button in every tweet, we can get the corresponding text as well.

Agreed. I am looking into the DOM manipulation part first -- adding a button and retrieving the context on click. Where context is the corresponding tweet id and data. Once we get that into a pouch document, it should be a bit easier to add indexing and search ability. What do you think about this approach?

Also, it is definitely easier to do an auto indexing of all tweets with oauth, but it would be a tremendous amount of stream for a power user.

blackforestboi commented 6 years ago

Once we get that into a pouch document, it should be a bit easier to add indexing and search ability. What do you think about this approach?

You may want to look into how pages are stored right now, as all the handling of storing in pouch etc is already implemented.

ansh103 commented 6 years ago

@oliversauter I was thinking about an Site Listener function(based on the site type, it'll be again divided) which will capture the link. This way we generalize the whole thing. But again that is goal set for later. So coming to storage part. Existing storage code base lies in storage.js. Right? @raincrash If you are looking into this. Then I can take up UI of the button. @arpitgogia You can establish the link between the two? Sounds good?

raincrash commented 6 years ago

So coming to storage part. Existing storage code base lies in storage.js. Right?

check out src/pouchdb.js and where ever pouchdb is manipulated, esp. in utils.

bohrium272 commented 6 years ago

check out src/pouchdb.js and where ever pouchdb is manipulated, esp. in utils

PouchDB isn't where the index is stored. In fact the Pouch part will be deprecated in the future in favor of IndexedDB. Check out src/page-storage and maybe src/search and of course @poltak knows the most about the whole search mechanism.

poltak commented 6 years ago

Now the question is whether we want it to be with our web pages storing directory or a new one for just all kinds of Social Media (Twitter,FB etc)

@ansh103 Our data side of things will mostly likely change in the next few months. I think originally the idea was to create a separate data type, but not sure if there's enough difference: A tweet should have text content, title (maybe we don't want these for tweets; not sure what the API provides), and url data + any associated visit/bookmark event data; basically all we need to make a page currently.

Tweet specific data needs to be stored somewhere though, probably in associated structure to pages, similar to visits/bookmarks currently.

What @arpitgogia says is right; stay away from Pouch, unless you just want to get it stored somewhere for now so you can work with it. Right now Pouch is just used for storing some images, and adds a huge amount of complexity trying to juggle 2 DBs for no reason - new implementation should hopefully address this and remove pouch.

tl;dr: if you want to get started straight away on this task, go for it but just keep in mind that the data structure may be completely changed later. What we want to do with the data-side of this feature will become more apparent as we progress on the new index implementation, so will keep it in mind

WorldBrain / Memex