ademjemaa / fbcrawl

facebook crawler
4 stars 4 forks source link

scrape fb page likes and followers #6

Open Rahulsunny11 opened 4 years ago

Rahulsunny11 commented 4 years ago

Hi @ademjemaa thanks for this amazing tool. I have been trying to scrape page details such as followers and likes of a page but I haven't been successful in that.. will you please help me with that Thanks :)

ademjemaa commented 4 years ago

Hello @Rahulsunny11

Thanks for taking interest in my project, I will try to look into likes and followers but I will let you know that i haven’t touched this project in a very long time so it might take me a while to get accustomed to it again.

However I will look into this specific issue and see if I can manage to add a spider that does exactly this, as in count the numbers of followers and likes and make a list of them.

However if the list of followers and likes isn’t available on mbasic.facebook i might have to make another spider that uses m.facebook or even www.facebook and that will take a long amount of time.

If the help you need with this is for learning web scraping I will be happy to try and explain to you how it works through detailed comments in the spiders and point you at documentation to read so you can get a better understanding of scrapy and xpath.

I will update you as soon as I take a look into the situation :)

Rahulsunny11 commented 4 years ago

thank you for the quick response.. I am facing difficulties in understanding scrapy it would be great if you hep me with that. Also in rugantio's fb crawl repository I found profile.py script.. how can i run that script.

ademjemaa commented 4 years ago

if you want to execute a spider you should use a command line in this format

scrapy crawl {spider name} -a email="" -a password="" -a page="" -a year="2015" -a lang="en" -o {export document name}.{format extension}

for example for my profile spider I type:

scrapy crawl profile -a email="" -a password="" -a page="" -o profile.csv

I dont know how he crawls profiles on his spider but just check the kwargs he needs in the spider you want to use and execute it with that command, if the spider won't run properly change to DEBUG and check where the crawler stops or encounters an error. If you encounter a very delicate error you should contact him directly, i do not know his spiders, i only used an old one he pushed almost two years ago back when he only had one spider.

ademjemaa commented 4 years ago

As for likes and followers on a page, I have checked on both mbasic and www.facebook and it seems that facebook keeps the list of profiles that like/follow pages hidden, you can only scrap the total number of likes and followers, that shouldn't be too hard to implement in the spider you use to crawl pages normally, just add a new item in the items list and look for the unique xpath that only leads to the total number of likes and followers.

If there's a way to have access to the list of people who like or follow a page please let me know and give me an example.

If you have issues with the implementation just let me know what it is and I can help you with it.

Rahulsunny11 commented 4 years ago

if you want to execute a spider you should use a command line in this format

scrapy crawl {spider name} -a email="" -a password="" -a page="" -a year="2015" -a lang="en" -o {export document name}.{format extension}

for example for my profile spider I type:

scrapy crawl profile -a email="" -a password="" -a page="" -o profile.csv

I dont know how he crawls profiles on his spider but just check the kwargs he needs in the spider you want to use and execute it with that command, if the spider won't run properly change to DEBUG and check where the crawler stops or encounters an error. If you encounter a very delicate error you should contact him directly, i do not know his spiders, i only used an old one he pushed almost two years ago back when he only had one spider.

As for likes and followers on a page, I have checked on both mbasic and www.facebook and it seems that facebook keeps the list of profiles that like/follow pages hidden, you can only scrap the total number of likes and followers, that shouldn't be too hard to implement in the spider you use to crawl pages normally, just add a new item in the items list and look for the unique xpath that only leads to the total number of likes and followers.

If there's a way to have access to the list of people who like or follow a page please let me know and give me an example.

If you have issues with the implementation just let me know what it is and I can help you with it.

Thanks for everything.

Rahulsunny11 commented 4 years ago

if you want to execute a spider you should use a command line in this format scrapy crawl {spider name} -a email="" -a password="" -a page="" -a year="2015" -a lang="en" -o {export document name}.{format extension} for example for my profile spider I type: scrapy crawl profile -a email="" -a password="" -a page="" -o profile.csv I dont know how he crawls profiles on his spider but just check the kwargs he needs in the spider you want to use and execute it with that command, if the spider won't run properly change to DEBUG and check where the crawler stops or encounters an error. If you encounter a very delicate error you should contact him directly, i do not know his spiders, i only used an old one he pushed almost two years ago back when he only had one spider.

As for likes and followers on a page, I have checked on both mbasic and www.facebook and it seems that facebook keeps the list of profiles that like/follow pages hidden, you can only scrap the total number of likes and followers, that shouldn't be too hard to implement in the spider you use to crawl pages normally, just add a new item in the items list and look for the unique xpath that only leads to the total number of likes and followers. If there's a way to have access to the list of people who like or follow a page please let me know and give me an example. If you have issues with the implementation just let me know what it is and I can help you with it.

Thanks for everything.

also can I crawl through multiple pages in one command.

ademjemaa commented 4 years ago

on the spiders I made you cannot but you can make a spider that takes a list of pages and crawls them one by one, I do not advise you to execute multiple spiders at the same time because you will easily get flagged by facebook (you use a facebook account to do the crawling and if you make too many requests in a small interval of time you will get banned)

Its possible to make a spider that crawls multiple pages at the same time though, using the spiders I have (since all the crawling and xpath handling is already in there) since you can change the crawling speed in the scrapy settings file and find a speed that makes it impossible for facebook to flag you.

Or you simply use a different facebook account for each page, am not sure how facebook deals and flags traffic from the same IP address though so I cannot tell you what to expect.

Rahulsunny11 commented 4 years ago

ok Thank you will try making another spider which runs through multiple pages and create a new account

Rahulsunny11 commented 4 years ago

on the spiders I made you cannot but you can make a spider that takes a list of pages and crawls them one by one, I do not advise you to execute multiple spiders at the same time because you will easily get flagged by facebook (you use a facebook account to do the crawling and if you make too many requests in a small interval of time you will get banned)

Its possible to make a spider that crawls multiple pages at the same time though, using the spiders I have (since all the crawling and xpath handling is already in there) since you can change the crawling speed in the scrapy settings file and find a speed that makes it impossible for facebook to flag you.

Or you simply use a different facebook account for each page, am not sure how facebook deals and flags traffic from the same IP address though so I cannot tell you what to expect.

hi again @ademjemaa, # navigate to provided page href = response.urljoin(self.page) self.logger.info('Scraping facebook page {}'.format(href)) return scrapy.Request(url=href, callback=self.parse_page, meta={'index': 1}) this is where the age link in given right? is it possible to change it into where i can say to read links from .txt files.. if so can please tell me what should be code be thanks

ademjemaa commented 4 years ago

Hello, i cannot give you a precise code because i dont have a pc these days, but i advise you to look at the way multiple post links in a page are handled a bit further down in the code

What I would do is use a “for” loop and iterate through page links one by one and yield each link to the next function

For page in <insert list to iterate through here> href = response.urljoin(self.page) self.logger.info('Scraping facebook page {}'.format(href)) Yield scrapy.Request(url=href,callback=self.parse_page,meta={'index':1})

You can either import a function that loads a list of lines (in this case links) from a file into an list or make your own.

Obviously i gave you an extremely simplified version of what your code will end up looking like but I believe it shouldn’t be very hard for you to find a way to both import a list from a file and iterate through it. However keep in mind that using this method your spider will only go to the next page when it’s fully done with the one its currently scraping, and that might take some time, afterall its just a basic loop, i don’t know if it’s possible to parse multiple pages at the same time while never getting flagged as a scraper by facebook, however if the pages you are targeting are not hard to parse it should be possible.

If you still have trouble finding a way to finish the spider i can help you with the code in a week or two when am less busy.

Good luck and please let me know when you are done with the spider ;)

Rahulsunny11 commented 4 years ago

Thank you.. also how does profile.py works in your repository? should I give page link or profile link?

Rahulsunny11 commented 4 years ago

also what is reactions.py file used for? is it for page or post? because when i gave scrapy crawl reactions -a email="" -a password="" -a post="/story.php?story_fbid=2715280352034215&id=1486073284954934" -a lang="it" -o babucomments.csv command I got error as `AttributeError: You need to provide a valid page name to crawl!scrapy fb -a page="PAGENAME"