User script portability regarding drivers like `*CrawlDriver` classes

gildas-lormeau commented 1 week ago

Overall, I find the idea interesting! For my part, I think I could implement a CDPCrawlDriver class (using the Chrome Devtools Protocol under the hood) in single-file-cli. Now, let's imagine a userscript written by a user for ArchiveBox that depends on PuppeteerCrawlDriver. Assuming the APIs of the two CrawlDriver classes are identical, If he wanted to run it in single-file-cli, would he be responsible for replacing the "puppeteer" occurrences with "cdp" in the userscript?

gildas-lormeau commented 1 week ago

Looking back, I realize that the userscript couldn't be ported as easily because the underlying APIs are totally different actually. As a result, the user would have to rewrite the code depending on the CDP API.

EDIT: Maybe it would be a SingleFileCrawlDriver instead, and would not intended to replace PuppeteerCrawlDriver (i.e. the API would not be the same)?

pirate commented 1 week ago

Yeah maybe recommending CDP before puppeteer/playwright is a good idea, for exactly the reason you're saying.

I think the order plugins should be implement scripts is something like

window almost every tool provides this, everything should start here
CDP this is more general than puppeteer/playwright, and is probably what should be next as you can directly access the CDP session from puppeteer or playwright and thus can use CDP scripts in either
Puppeteer/playwright specific code, if the maintainer is more familiar with these than directly using CDP they can write these but with the knowledge that their scripts won't work in every tool (e.g. singlefile, or ArchiveBox if they provide playwright but not puppeteer, etc)
Other contexts

If I understand correctly CDP is an event driven API anyway, so it may be easy to expose a common interface for plugins to send CDP events even if they're using puppeteer/playwright.

pirate commented 1 week ago

Can you provide an example of how you're using the CDP APIs currently? I dug through the single-file codebase a little but I didn't find any obvious CDP or chrome.debugger calls (sorry I'm not super familiar with raw CDP), I mostly saw browser. calls from background.js.

I can help show how the CDP stuff could be written as a cdp hook in a Behavior.

gildas-lormeau commented 1 week ago

You can find the code using the CDP API here: https://github.com/gildas-lormeau/single-file-cli/blob/a5dc004949b4a8b5180ffb53461a6305b6b4d07a/lib/cdp-client.js (you were searching in the wrong repository).

I have a more general question, single-file-cli is capable of crawling sites. Because of this, I don't know if I should read your spec proposal as a spec implementer or consider single-file-cli as just a Driverclass in other crawlers, e.g. ArchiveBox, or both?

pirate commented 1 week ago

Ok so for simplecdp a behavior might look like this:

const AdDetectorBehavior = {
    name: 'AdDetectorBehavior',
    schema: 'BehaviorSchema@0.1.0',
    version: '0.1.0',

    // known ad network domains/patterns
    AD_PATTERNS: [
        'doubleclick.net',
        'googlesyndication.com',
        'adnxs.com',
        '/ads/',
        '/adserve/',
        'analytics',
        'tracker',
    ],

    hooks: {
        simplecdp: {
            PAGE_SETUP: async (event, BehaviorBus, cdp) => {
                await cdp.Network.enable();

                await cdp.Network.setRequestInterception({ patterns: [{ urlPattern: '*' }] });

                cdp.Network.requestIntercepted(async ({ interceptionId, request }) => {
                    const isAd = AdDetectorBehavior.AD_PATTERNS.some(pattern => request.url.includes(pattern));

                    if (isAd) {
                        BehaviorBus.emit({
                            type: 'DETECTED_AD',
                            url: request.url,
                            timestamp: Date.now(),
                            requestData: {
                                method: request.method,
                                headers: request.headers,
                            },
                        });

                        // either block the request or let it continue
                        await cdp.Network.continueInterceptedRequest({
                            interceptionId,
                            errorReason: 'blocked'  // or remove this to let ads load
                        });
                    } else {
                        await cdp.Network.continueInterceptedRequest({ interceptionId });
                    }
                });
            },
        }
    }
};

export default AdDetectorBehavior;

I have a more general question, single-file-cli is capable of crawling sites. Because of this, I don't know if I should read your spec proposal as a spec implementer or consider single-file-cli

So to use behaviors you'd add someting like this to your existing single-file-cli crawling setup code:

async function getPageData(options) {
        ...
        const cdp = new CDP(targetInfo);
        const { Browser, Security, Page, Emulation, Fetch, Network, Runtime, Debugger, Console } = cdp;
        ...

        const BehaviorBus = new BehaviorBus();
        BehaviorBus.attachContext(cdp);
        BehaviorBus.attachBehaviors([AdDetectorBehavior]);

        await Page.addScriptToEvaluateOnNewDocument({
            source: `
                window.BEHAVIORS = [${JSON.stringify(AdDetectorBehavior)}];
                ${fs.readFileSync('behaviors.js')};
                window.BehaviorBus.addEventListener('*', (event) => {
                    if (!event.detail.metadata.path.includes('SimpleCDPBehaviorBus')) {
                        dispatchEventToCDPBehaviorBus(JSON.stringify(event.detail));
                    }
                });
            `,
            runImmediately: true,
        });

        // set up forwarding from WindowBehaviorBus -> SimpleCDPBehaviorBus
        await Runtime.addBinding({name: 'dispatchEventToCDPBehaviorBus'});
        Runtime.bindingCalled(({name, payload}) => {
            if (name === 'dispatchEventToCDPBehaviorBus') {
                BehaviorBus.dispatchEvent(JSON.parse(payload));
            }
        });

        // set up forwarding from SimpleCDPBehaviorBus -> WindowBehaviorBus 
        BehaviorBus.addEventListener('*', (event) => {
            event = new BehaviorEvent(event);
            if (!event.detail.metadata.path.includes('WindowBehaviorBus')) {
                cdp.Runtime.evaluate({
                    expression: `
                        const event = new BehaviorEvent(${JSON.stringify(event.detail)});
                        window.BehaviorBus.dispatchEvent(event);
                    `
                });
            }
        });

       ...
       BehaviorBus.emit({type: 'PAGE_SETUP', url})

       // starting load the to capture URL
       const [contextId] = await Promise.all([
            loadPage({ Page, Runtime }, options, debugMessages),
            options.browserDebug ? waitForDebuggerReady({ Debugger }) : Promise.resolve()
       ]);

       BehaviorBus.emit({type: 'PAGE_LOAD', url})

       ...
       BehaviorBus.emit({type: 'PAGE_CAPTURE, url})
       ...
}

gildas-lormeau commented 1 week ago

Thanks for the info! I haven't tested the code but I understand the principle and it it sounds good to me. This pattern would probably help to better organize the code in cdp-client.js.

pirate commented 1 week ago

Ok cool, don't do any big changes to your code just yet! I'm still discussing the design with webrecorder / not convinced it's good enough yet.

I'll keep you posted! Let me know if you have any ideas on other approaches or how to improve it.

pirate commented 6 days ago

What are your thoughts on https://w3c.github.io/webdriver-bidi/ ? It seems like CDP is going away slowly in favor of it, so I'm considering removing playwright/puppeteer/cdp contexts in the spec in favor of focing bidi to be the common spec for browser-layer commands. Unfortunately it's not as clean as your nice proxy model solution and there's a lot of common utilities that are missing (e.g. waitForSelector(...)), but it might be the only way to have a unified format across all browsers/tools?

Scripts would look something llike this:

// Using raw WebSocket from browser or Node for BiDi connection
import WebSocket from 'ws';

// this would be built into the spec / utility library
class WebDriverBiDi {
    constructor(websocketUrl) {
        this.ws = new WebSocket(websocketUrl);
        this.messageId = 0;
        this.subscribers = new Map();

        this.ws.on('message', (data) => {
            const message = JSON.parse(data);
            if (message.id) {
                const subscriber = this.subscribers.get(message.id);
                if (subscriber) {
                    subscriber(message);
                    this.subscribers.delete(message.id);
                }
            }
        });
    }

    async send(method, params = {}) {
        const id = ++this.messageId;
        const message = {
            id,
            method,
            params
        };

        return new Promise((resolve) => {
            this.subscribers.set(id, resolve);
            this.ws.send(JSON.stringify(message));
        });
    }
}

async function example() {
    // Connect to Chrome's BiDi endpoint
    // Chrome should be started with: --enable-bidi-protocol
    const bidi = new WebDriverBiDi('ws://localhost:9222/session');

    // Create a new context (tab)
    const { result: { context: contextId } } = await bidi.send('browsingContext.create', {
        type: 'tab'
    });

    // the code below here is what would be implemented inside a behavior...

    // Set up network interception
    await bidi.send('network.addIntercept', {
        phases: ['beforeRequestSent'],
        patterns: [{ urlPattern: '*example.com*' }]
    });

    await bidi.send('network.onIntercept', {
        callback: (params) => {
            if (params.phase === 'beforeRequestSent' && params.request.url.includes('example.com')) {
                return {
                    action: 'block'
                };
            }
            return { action: 'continue' };
        }
    });

    // Navigate to a URL
    await bidi.send('browsingContext.navigate', {
        context: contextId,
        url: 'https://google.com'
    });

    // Wait for element to appear
    const script = `
        new Promise((resolve) => {
            const checkElement = () => {
                const element = document.querySelector('input[name="q"]');
                if (element) {
                    resolve(true);
                } else {
                    requestAnimationFrame(checkElement);
                }
            };
            checkElement();
        });
    `;

    await bidi.send('script.evaluate', {
        context: contextId,
        expression: script,
        awaitPromise: true
    });

    console.log('Search input found!');
}

// Run the example
example().catch(console.error);

A few potential benefits:

bidi is standardized across all browsers
existing puppeteer/playwright/cdp code can be easily translated to BiDi with claude because it's very well specifed
bidi is low level enough that there's very little it cant do
bidi websocket commands could be filtered by the driver to implement some "permissions" / limits on what random behaviors can do
the BehaviorBus event forwarding and support for multiple contexts could be removed entirely in favor of just providing a bidi websocket directly in the page context (e.g. window.BIDI and having all behaviors run in the context of window. if they need a CDP/bidi command they can just call await window.BIDI.send(...) from inside the page.

gildas-lormeau commented 4 days ago

I think the WebDriver BiDi standard is a very good initiative. I'd had a look but hadn't noticed the existence of the script.addPreloadScript command, that's the point that blocked me in the past with WebDriver. I'll have to do some testing but I'm interested in replacing the CDP client with a BiDi client. My basic need was to be able to provide executables that weren't too heavy. That's why I went down this road.

In the short term, I think I'll try to implement a library based on the Proxy API.

ArchiveBox / abx-spec-behaviors

User script portability regarding drivers like `*CrawlDriver` classes #1