annidy / notes

1 stars 0 forks source link

selenium crawl #95

Open annidy opened 12 months ago

annidy commented 12 months ago
  1. Save session when next open
    
    from selenium import webdriver
    from selenium.webdriver import ChromeOptions

options = ChromeOptions() options.add_argument(r'''--user-data-dir=C:\Users\Admin\AppData\Local\Google\Chrome\User Data\'''') driver = webdriver.Chrome(options)

driver.get("xxx")

**WARNING**: if two process uses the same user data dir, the second one will encounter "Message: session not created: Chrome failed to start: exited normally.\n  (session not created: DevToolsActivePort"

2. Find element

`e = driver.find_element(by=By.NAME, value="my-text")`

~value support regex. `elements = driver.find_elements(By.CLASS_NAME, re.compile("index_\w+"))`~

find all and filter

regex_classname = re.compile("index\w+") [e for e in driver.find_elements(By.XPATH, ".//*") if regex_class_name.match(str(e.get_attribute("class")))]

or, use XPath operator for simple regex

"//input[contains(@name,'sel')]" "//input[starts-with (@name,'Tut')]" "//input[ends-with (@name,'nium')]"


By.XPATH is the most powerful/complex. [XPath 语法](https://www.w3school.com.cn/xpath/xpath_syntax.asp), [Xpath cheatsheet](https://devhints.io/xpath)

In fact, you can use `$x(path[, startNode])` to test in the chrome devtool console.
[Console Utilities API reference](https://developer.chrome.com/docs/devtools/console/utilities/#xpath-function)

find supports [subset](https://www.selenium.dev/zh-cn/documentation/webdriver/elements/finders/#evaluating-a-subset-of-the-dom) dom
`driver.find_element(by=By.NAME, value="my-text").find_element(...)`

3. Get element information
https://www.selenium.dev/zh-cn/documentation/webdriver/elements/information/
is_disaplay()、is_enable()、frame、css、title...

4. Interactive
use [element](https://www.selenium.dev/zh-cn/documentation/webdriver/elements/interactions/)
click()
use [bidirectional](https://www.selenium.dev/zh-cn/documentation/webdriver/bidirectional/)

browser.execute_script( ''' var kw = document.getElementById('kw'); var su = document.getElementById('su'); kw.value = 'Selenium'; su.click(); ''' )



5. Wait
[等待页面加载完成(Waits)](https://selenium-python-zh.readthedocs.io/en/latest/waits.html)
annidy commented 11 months ago

Network capture

  1. use bidi_connection
    
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.chrome.ChromeDriver;
    import org.openqa.selenium.devtools.DevTools;
    import org.openqa.selenium.devtools.v108.network.Network;

public class CodekruTest {

public static void main(String[] args) {

    // pass the path of the chromedriver location in the second argument
    System.setProperty("webdriver.chrome.driver", "E:\\chromedriver.exe");
    WebDriver driver = new ChromeDriver();

    DevTools devTools = ((ChromeDriver) driver).getDevTools();
    devTools.createSession();
    devTools.send(Network.enable(Optional.of(1000000), Optional.empty(), Optional.empty()));
    devTools.addListener(Network.requestWillBeSent(), request -> {
        System.out.println("Request Method : " + request.getRequest().getMethod());
        System.out.println("Request URL : " + request.getRequest().getUrl());
        System.out.println("Request headers: " + request.getRequest().getHeaders().toString());
        System.out.println("Request body: " + request.getRequest().getPostData().toString());
    });

    driver.get("https://www.makemytrip.com/");

}

}

https://www.codekru.com/selenium/how-to-get-network-call-requests-in-selenium
https://stackoverflow.com/questions/72912626/selenium-4-chrome-devtools-using-python-fetch-fail-request-not-failing-the-reque

2. Use logs

import json from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

caps = DesiredCapabilities.CHROME caps['goog:loggingPrefs'] = {'performance': 'ALL'} driver = webdriver.Chrome(desired_capabilities=caps)
driver.get('https://stackoverflow.com/questions/52633697/selenium-python-how-to-capture-network-traffics-response')

def process_browser_log_entry(entry): response = json.loads(entry['message'])['message'] return response

browser_log = driver.get_log('performance') events = [process_browser_log_entry(entry) for entry in browser_log] events = [event for event in events if 'Network.response' in event['method']]



https://stackoverflow.com/questions/52633697/selenium-python-how-to-capture-network-traffics-response
https://gist.github.com/lorey/079c5e178c9c9d3c30ad87df7f70491d

3. Use seleniumwire