eddyharrington / WhatSoup

A web scraper that exports your entire WhatsApp chat history.
MIT License
121 stars 49 forks source link

Message: no such element: Unable to locate element: {"method":"css selector","selector":"span"} #3

Open oddtazz opened 3 years ago

oddtazz commented 3 years ago
eddyharrington commented 3 years ago

@oddtazz This is a tricky one for me because I cannot reproduce your error with my computers, which means there is likely something different/unique about your WhatsApp and HTML that I have not encountered yet. I'm unable to provide any precise fixes for this unless I can see some of the HTML to investigate the cause.

If you are OK sharing your HTML with private info removed, then run the whatsoup.py script again and when the error is thrown, don't close the script or your browser but instead open up dev tools in Chrome (F12), run the below JavaScript snippet in the Console, verify there's no other private info in it, and then send it to me on here or you can send it privately/offline as well. I attached a photo of how it looks when ran.

If you are not OK with this approach, then an easy workaround is replacing line 212 of whatsoup.py with last_chat_msg = '' which will just make the terminal window not show any of the last chat messages when you're selecting a chat to scrape/export without affecting the actual scraping/exporting of your chat history.

// This script sets all names/messages from chats in the left chat pane to 'redacted' instead of private info

// Get all of the viewable 'chat cards' from left pane. A chat card is what it sounds like, the rectangle with a person/groups photo and name, last message, and last message time. Note: these load dynamically based on your viewport, so it will only get the cards which can be seen in your browser window.
var chat_cards = document.querySelector('[aria-label*="Chat list"]').childNodes

// Loop through each of the viewable chat cards
for (let i=0; i<chat_cards.length; i++){
  // Get all descendants for a chat card
  let elems = chat_cards[i].querySelectorAll('*')

  // Loop through each descendant and search for the elements which hold private data, replacing it with 'redacted'
  for (let j=0; j<elems.length; j++){
    if (elems[j].getAttribute('title')){
      console.log("Replacing '%s' with 'redacted'", elems[j].getAttribute('title'));
      elems[j].setAttribute('title','redacted');
      elems[j].innerText = 'redacted';
    }
  }
}

// Print the redacted HTML so it can be safely shared
document.querySelector('[aria-label*="Chat list"]').outerHTML;

redacted-whatsapp

oddtazz commented 3 years ago

Hey I have created a secret gist with the information you asked for https://gist.githubusercontent.com/oddtazz/10bb6e2111c392d960c53db6146efff4/raw/86f985f04b90c81f54e5ef24f92653b14dd260cf/redacted.html

eddyharrington commented 3 years ago

@oddtazz I'm not seeing anything out of the ordinary. Any chance you have chrome extensions installed that may be modifying the DOM?

Also, did you try the other workaround I noted above by setting the offending line to an empty string: last_chat_msg = ''? The only critical piece of data that's needed in the get_chats function is name_of_chat which you aren't having issues with, so setting the last_chat_msg variable to an empty string should fix the issue.

oddtazz commented 3 years ago

I have no chrome extensions, chromedriver launches a completely different instance of chrome doesn't it. Just to be sure I disabled all extensions and ran the script again which gives me the same stack trace.

I also set last_chat_msg to an empty string but this too gives me the same error.

So here's what I am thinking: I have tried this script on 2 macbook pros (13" and 15") and a windows machine. I get the same error in all the places so it is probably not operating system related or hardware related. I am using Version 89.0.4389.82 of chromedriver and browser which is the latest version at the moment. Is it the same version you are using too? I can't think of any other issue which could cause this behavior.

eddyharrington commented 3 years ago

We are using the same version of Chrome. Thanks for checking on multiple machines and ruling out hardware.

I wonder if WhatsApp is changing some of its UI based on your locale? Can we compare localization?

  1. Check Chrome's language settings. If you go to chrome://settings/languages you can verify what language your browser is using. Mine is using English.
  2. Check HTTP Header language. You can check it on HTTP Bin under Accept-Language. Mine is "en-US,en;q=0.5"

Also just to clarify setting last_chat_msg to an empty string is still not working for you...can you try replacing the entire get_chats function starting at line 152 with this and running the script again? Please share any traceback if it throws an error again.

def get_chats(driver):
    '''Traverses the WhatsApp chat-pane via keyboard input and collects chat information such as person/group name, last chat time and msg'''

    print("Loading your chats...", end="\r")

    # Wrap entire function in a retryable try/catch because chat-pane DOM changes frequently due to users typing, sending messages, and occasional WhatsApp notifications
    retry_attempts = 0
    while retry_attempts < 3:
        retry_attempts += 1

        # Try traversing the chat-pane
        try:
            # Find the chat search (xpath == 'Search or start new chat' element)
            chat_search = driver.find_element_by_xpath(
                '//*[@id="side"]/div[1]/div/label/div/div[2]')
            chat_search.click()

            # Count how many chat records there are below the search input by using keyboard navigation because HTML is dynamically changed depending on viewport and location in DOM
            selected_chat = driver.switch_to.active_element
            prev_chat_id = None
            is_last_chat = False
            chats = []

            # Descend through the chats
            while True:
                # Navigate to next chat
                selected_chat.send_keys(Keys.DOWN)

                # Set active element to new chat (without this we can't access the elements '.text' value used below for name/time/msg)
                selected_chat = driver.switch_to.active_element

                # Check if we are on the last chat by comparing current to previous chat
                if selected_chat.id == prev_chat_id:
                    is_last_chat = True
                else:
                    prev_chat_id = selected_chat.id

                # Gather chat info (chat name, chat time, and last chat message)
                if is_last_chat:
                    break
                else:
                    # Get the container of the contact card's title (xpath == parent div container to the span w/ title attribute set to chat name)
                    contact_title_container = selected_chat.find_element_by_xpath(
                        "./div/div[2]/div/div[1]")
                    # Then get all the spans it contains
                    contact_title_container_spans = contact_title_container.find_elements_by_tag_name(
                        'span')
                    # Then loop through all those until we find one w/ a title property
                    for span_title in contact_title_container_spans:
                        if span_title.get_property('title'):
                            name_of_chat = span_title.get_property('title')
                            break

                    # Store chat info within a dict
                    chat = {"name": name_of_chat, "time": '', "message": ''}
                    chats.append(chat)

            # Navigate back to the top of the chat list
            chat_search.click()
            chat_search.send_keys(Keys.DOWN)

            print("Success! Your chats have been loaded.")
            break

        # Catch errors related to DOM changes
        except (StaleElementReferenceException, ElementNotInteractableException) as e:
            if retry_attempts == 3:
                # Make sure we grant user option to exit if DOM keeps changing while scanning chat list
                print("This is taking longer than usual...")
                while True:
                    response = input(
                        "Try loading chats again (y/n)? ")
                    if response.strip().lower() in {'n', 'no'}:
                        print(
                            'Error! Aborting chat load by user due to frequent DOM changes.')
                        if type(e).__name__ == 'StaleElementReferenceException':
                            raise StaleElementReferenceException
                        else:
                            raise ElementNotInteractableException
                    elif response.strip().lower() in {'y', 'yes'}:
                        retry_attempts = 0
                        break
                    else:
                        continue
            else:
                pass

    return chats
eddyharrington commented 3 years ago

@oddtazz Checking in to see if the above suggestion resolved the issue for you?

oddtazz commented 3 years ago

Hey sorry for the late reply Eddy, I managed to get it to work after your previous comment. Turns out you were using en-US,en and I was using en-IN,en,en-UK English(India), English and English(UK) languages.

My Accept-Language = "en-UK,en;q=0.9,en-IN;q=0.8"

The solution that worked for me was to make my Accept-Language look like yours. Thanks for making this software!

eddyharrington commented 3 years ago

@oddtazz Thanks for confirming! Quick question, did you solve this only by setting Accept-Language? For example did you also set WhatsApp language on your phone to English? I'm asking because I'm testing it with the en-UK locale you provided and still can't reproduce the error.

I added this line to setup_selenium():

# Set locale to @oddtazz's config
options.add_experimental_option('prefs', {'intl.accept_languages': 'en-UK,en;q=0.9,en-IN;q=0.8'})

HTTP Bin shows: "Accept-Language": "en-UK,en;q=0.9,en;q=0.9;q=0.8,en-IN;q=0.8;q=0.7" however I don't see any changes to WhatsApp UI and my script doesn't throw any errors.

oddtazz commented 3 years ago

I managed to break this again today with accepted languages as "Accept-Language": "en-US,en;q=0.9" I am sure I have made no changes to chrome apart from fiddling with the language settings.

Maybe language is not the issue in this case, but that would lead to a bigger puzzle. What changed between 6 days ago and today? Specially since chrome version is the same. I would say treat my issue as an edge case. Maybe I am doing something very different compared to others. (It would be helpful to know in what way though :p )

Regardless keep up the good work!

wandabwa2004 commented 3 years ago

I managed to break this again today with accepted languages as "Accept-Language": "en-US,en;q=0.9" I am sure I have made no changes to chrome apart from fiddling with the language settings.

Maybe language is not the issue in this case, but that would lead to a bigger puzzle. What changed between 6 days ago and today? Specially since chrome version is the same. I would say treat my issue as an edge case. Maybe I am doing something very different compared to others. (It would be helpful to know in what way though :p )

Regardless keep up the good work!

@oddtazz , just check to make sure the chat its breaking on is not blocked. Mine was breaking and it wasn't a language issue . It broke on a blocked chat.

z404 commented 3 years ago

Yep, I've got the same issue of this breaking on a contact that I've blocked.

camagenta commented 3 years ago

So do I, when I deleted the blocked contact the proses continue.

But I still get the issue. Dunno still finding the another issue

amitvyas17 commented 2 years ago

And mine issue is like its runs perfectly and at the end it asks me to select the format like csv,txt or html and when is select any format it runs and asks me that i want to export any other chat and after when i click no it closes but when i open the exported chat it only shows the header it doesn't have any content please help

oddtazz commented 8 months ago

Closing this bug report as this project is not maintained anymore.