cbdelavenne / fb-messenger-media-scraper

Helper script to scrape your Facebook Messenger account for images shared in your conversations quickly and efficiently.
MIT License
6 stars 1 forks source link

No `[Messages][search_limit]` limit? #1

Open coreyzev opened 4 years ago

coreyzev commented 4 years ago

I want to use this on a conversation that spanned over 4 years. It will download close to a thousand images probably. There are probably tens of thousands of messages.

Rather than me guessing 100000000, would it make sense to just have a no limit option?

Thanks!

cbdelavenne commented 4 years ago

For sure! It would completely make sense. I haven't spent time trying to improve the current version, but in theory, I would need to make some of the functions recursive. If there's interest, I can look into implementing it.

mockshox commented 4 years ago

@cbdelavenne I see a great potential in this piece of code!

Frankly, I'm facing bad things trying to download all ~20k messages with

fb_client.fetchThreadMessages('1234567890', limit=20000)

This is probably a fbchat issue (like bad support for GraphQL pagination) or Facebook itself that doesn't want to share that much data at once.

Traceback (most recent call last):
  File "fbm-scraper.py", line 105, in <module>
    messages = fb_client.fetchThreadMessages('2895323113875834', limit=20000)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/fbchat/_client.py", line 783, in fetchThreadMessages
    j = self.graphql_request(_graphql.from_doc_id("1860982147341344", params))
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/fbchat/_client.py", line 185, in graphql_request
    return self.graphql_requests(query)[0]
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/fbchat/_client.py", line 177, in graphql_requests
    return tuple(self._post("/api/graphqlbatch/", data, as_graphql=True))
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/fbchat/_client.py", line 134, in _post
    content = check_request(r)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/fbchat/_util.py", line 156, in check_request
    check_http_code(r.status_code)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/fbchat/_util.py", line 171, in check_http_code
    raise FBchatFacebookError(msg, request_status_code=code)
fbchat._exception.FBchatFacebookError: Error when sending request: Got 500 response.

A workaround for this could be to scrape messages week by week or, even better, day by day..

cbdelavenne commented 4 years ago

Hey @xbonio, I'll try to look into this some time this week. It's definitely feasible to improve the search. It could even be somewhat rewritten to use recursion. I'll see what I can do. Feel free to contribute a solution if you'd like!

jakewilliami commented 4 years ago

Bump. Has this been updated at all? It looks like the current limit is 20:

Traceback (most recent call last):
  File "./fbm-scraper.py", line 92, in <module>
    threads = fb_client.fetchThreadList(limit=thread_search_limit)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fbchat/_client.py", line 826, in fetchThreadList
    raise FBchatUserError("`limit` should be between 1 and 20")
fbchat._exception.FBchatUserError: `limit` should be between 1 and 20

Which partially defeats the purpose of automating it. Please let me know—and once again, great work on this @cbdelavenne :)

jakewilliami commented 4 years ago

It looks like the limit of 20 is a feature of the package used. An alternative (in pseudo-code because I don't know python very well) would be:

import math

if thread_search_limit >= 20 or thread_search_limit == "no_limit":
    number_of_iterations = math.ceil(thread_search_limit / 20)
    for i in range(1, number_of_iterations+1):
        if i != number_of_iterations:
            download_range = ((i*20)-19, i*20) # ((i*20)-20-1,i*20)
        else:
            download_range = ((i*20)-19, thread_search_limit-((i-1)*20))
            [download files in download_range]

Maybe from here you could play around with the API?

cbdelavenne commented 4 years ago

@jakewilliami Yea, unfortunately I haven't had the time to dedicate on improving this script, but seeing that there's some growing interest, I'll likely revisit.

I'll give your suggestion a shot and see if it yields the desired result!