Altimis / Scweet

A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...
MIT License
1.07k stars 226 forks source link

Experimental: Recursive Following (AKA Friend) Tree Function #56

Open BradKML opened 3 years ago

BradKML commented 3 years ago

As demonstrated in the repos below, recursive graph mining of Twitter (following AKA friends) is doable under API

However, to do this without API may be tricky, as there needs to be a breadth-first search function that prevents repeated scraping. As demonstrated in the bottom links, three degrees of following/friends is enough to analyze the community.

BradKML commented 3 years ago

This function can be borrowed easily with a simple import queue

class Frontier():
    """ Objects of this class keep track of the Breadth-First Search
    Frontier as it is expanded. 
    """
    def __init__(self, src, expander):
        self.internal = set()
        self.perimeter = set()

        # The BFS queue contains the same elements as self.perimeter
        # but in bfs order, i.e. elements closer to source are nearer
        # to the beginning of the queue.
        self.queue = queue.Queue()

        self.perimeter.add(src)
        self.queue.put_nowait(src)

        # some metadata to be able to trace back the path from a node
        # to the source node
        self.metadata = {}
        self.metadata[src] = {"distance": 0, "parent": None}

        # This function is called on a node when we need to get the
        # outgoing edges. It is going to be different for the source
        # and destination nodes in our social graph. Returns a set.
        self.expander = expander

    def expand_perimeter(self):
        """ Expands the pperimeter by picking up the node at the front
        of the queue and expanding it to its followers/friends depending
        upon the definition of this.expander function. 

        Returns the new elements found outside the periphery.
        """
        # User at the front of the breadth-first queue is made internal
        u = self.queue.get_nowait()
        self.internal.add(u)
        self.perimeter.remove(u)

        # let's move the frontier forward at the node 'u'
        new_nodes = self.expander(u).difference(self.internal, self.perimeter)
        print(".", end="")
        sys.stdout.flush()

        # Keep track of distance of a node from src and its parent
        d = self.metadata[u]["distance"]
        for n in new_nodes:
            self.metadata[n] = {"distance": d + 1, "parent": u}

        self.perimeter.update(new_nodes)
        list(map(self.queue.put_nowait, new_nodes))

        return set(new_nodes)

    def is_on_perimeter(self, user_id):
        """ Tells whether user_id is in on the perimeter of this bfs frontier.
        """
        return (user_id in self.perimeter)

    def covered_all(self):
        """ True if we have covered all the graph, i.e.
        The whole graph is now internal. There remain no nodes on 
        the periphery.
        """
        return bool(self.perimeter)

    def get_parent(self, n):
        """ Returns the parent for node n.
        Will throw KeyError when we haven't seen the node before. 
        But in this application, we don't expect that to happen. So, if it
        happens, something is really messed up.
        """
        return self.metadata[n]["parent"]

    def get_distance(self, n):
        """" Returns the distance of node `n` from the source.
        """
        return self.metadata[n]["distance"]

The Frontier object only takes a seed note and a lambda function (of the scraper) as the main component. I can imagine variable = Frontier(seed_user, lambda x: get_users_following(users=x, env=env_path, verbose=0, headless=True, wait=2, limit=50, file_path=None)) being the main command that connects to it, however this will cause duplicate calls.

BradKML commented 3 years ago

To warn me of the issue: There needs to be a way to cache and restore instances with this Object. Another problem: Collective Pulling if given a name list https://github.com/narendraj9/digsep/issues/2