Open gildas-lormeau opened 1 week ago
Looking back, I realize that the userscript couldn't be ported as easily because the underlying APIs are totally different actually. As a result, the user would have to rewrite the code depending on the CDP
API.
EDIT: Maybe it would be a SingleFileCrawlDriver
instead, and would not intended to replace PuppeteerCrawlDriver
(i.e. the API would not be the same)?
Yeah maybe recommending CDP before puppeteer/playwright is a good idea, for exactly the reason you're saying.
I think the order plugins should be implement scripts is something like
window
almost every tool provides this, everything should start hereCDP
this is more general than puppeteer/playwright, and is probably what should be next as you can directly access the CDP session from puppeteer or playwright and thus can use CDP scripts in eitherIf I understand correctly CDP is an event driven API anyway, so it may be easy to expose a common interface for plugins to send CDP events even if they're using puppeteer/playwright.
Can you provide an example of how you're using the CDP APIs currently? I dug through the single-file
codebase a little but I didn't find any obvious CDP
or chrome.debugger
calls (sorry I'm not super familiar with raw CDP), I mostly saw browser.
calls from background.js
.
I can help show how the CDP stuff could be written as a cdp
hook in a Behavior
.
You can find the code using the CDP API here: https://github.com/gildas-lormeau/single-file-cli/blob/a5dc004949b4a8b5180ffb53461a6305b6b4d07a/lib/cdp-client.js (you were searching in the wrong repository).
I have a more general question, single-file-cli
is capable of crawling sites. Because of this, I don't know if I should read your spec proposal as a spec implementer or consider single-file-cli
as just a Driver
class in other crawlers, e.g. ArchiveBox, or both?
Ok so for simplecdp a behavior might look like this:
const AdDetectorBehavior = {
name: 'AdDetectorBehavior',
schema: 'BehaviorSchema@0.1.0',
version: '0.1.0',
// known ad network domains/patterns
AD_PATTERNS: [
'doubleclick.net',
'googlesyndication.com',
'adnxs.com',
'/ads/',
'/adserve/',
'analytics',
'tracker',
],
hooks: {
simplecdp: {
PAGE_SETUP: async (event, BehaviorBus, cdp) => {
await cdp.Network.enable();
await cdp.Network.setRequestInterception({ patterns: [{ urlPattern: '*' }] });
cdp.Network.requestIntercepted(async ({ interceptionId, request }) => {
const isAd = AdDetectorBehavior.AD_PATTERNS.some(pattern => request.url.includes(pattern));
if (isAd) {
BehaviorBus.emit({
type: 'DETECTED_AD',
url: request.url,
timestamp: Date.now(),
requestData: {
method: request.method,
headers: request.headers,
},
});
// either block the request or let it continue
await cdp.Network.continueInterceptedRequest({
interceptionId,
errorReason: 'blocked' // or remove this to let ads load
});
} else {
await cdp.Network.continueInterceptedRequest({ interceptionId });
}
});
},
}
}
};
export default AdDetectorBehavior;
I have a more general question, single-file-cli is capable of crawling sites. Because of this, I don't know if I should read your spec proposal as a spec implementer or consider single-file-cli
So to use behaviors you'd add someting like this to your existing single-file-cli
crawling setup code:
async function getPageData(options) {
...
const cdp = new CDP(targetInfo);
const { Browser, Security, Page, Emulation, Fetch, Network, Runtime, Debugger, Console } = cdp;
...
const BehaviorBus = new BehaviorBus();
BehaviorBus.attachContext(cdp);
BehaviorBus.attachBehaviors([AdDetectorBehavior]);
await Page.addScriptToEvaluateOnNewDocument({
source: `
window.BEHAVIORS = [${JSON.stringify(AdDetectorBehavior)}];
${fs.readFileSync('behaviors.js')};
window.BehaviorBus.addEventListener('*', (event) => {
if (!event.detail.metadata.path.includes('SimpleCDPBehaviorBus')) {
dispatchEventToCDPBehaviorBus(JSON.stringify(event.detail));
}
});
`,
runImmediately: true,
});
// set up forwarding from WindowBehaviorBus -> SimpleCDPBehaviorBus
await Runtime.addBinding({name: 'dispatchEventToCDPBehaviorBus'});
Runtime.bindingCalled(({name, payload}) => {
if (name === 'dispatchEventToCDPBehaviorBus') {
BehaviorBus.dispatchEvent(JSON.parse(payload));
}
});
// set up forwarding from SimpleCDPBehaviorBus -> WindowBehaviorBus
BehaviorBus.addEventListener('*', (event) => {
event = new BehaviorEvent(event);
if (!event.detail.metadata.path.includes('WindowBehaviorBus')) {
cdp.Runtime.evaluate({
expression: `
const event = new BehaviorEvent(${JSON.stringify(event.detail)});
window.BehaviorBus.dispatchEvent(event);
`
});
}
});
...
BehaviorBus.emit({type: 'PAGE_SETUP', url})
// starting load the to capture URL
const [contextId] = await Promise.all([
loadPage({ Page, Runtime }, options, debugMessages),
options.browserDebug ? waitForDebuggerReady({ Debugger }) : Promise.resolve()
]);
BehaviorBus.emit({type: 'PAGE_LOAD', url})
...
BehaviorBus.emit({type: 'PAGE_CAPTURE, url})
...
}
Thanks for the info! I haven't tested the code but I understand the principle and it it sounds good to me. This pattern would probably help to better organize the code in cdp-client.js
.
Ok cool, don't do any big changes to your code just yet! I'm still discussing the design with webrecorder / not convinced it's good enough yet.
I'll keep you posted! Let me know if you have any ideas on other approaches or how to improve it.
What are your thoughts on https://w3c.github.io/webdriver-bidi/ ? It seems like CDP is going away slowly in favor of it, so I'm considering removing playwright/puppeteer/cdp contexts in the spec in favor of focing bidi to be the common spec for browser-layer commands. Unfortunately it's not as clean as your nice proxy model solution and there's a lot of common utilities that are missing (e.g. waitForSelector(...)
), but it might be the only way to have a unified format across all browsers/tools?
Scripts would look something llike this:
// Using raw WebSocket from browser or Node for BiDi connection
import WebSocket from 'ws';
// this would be built into the spec / utility library
class WebDriverBiDi {
constructor(websocketUrl) {
this.ws = new WebSocket(websocketUrl);
this.messageId = 0;
this.subscribers = new Map();
this.ws.on('message', (data) => {
const message = JSON.parse(data);
if (message.id) {
const subscriber = this.subscribers.get(message.id);
if (subscriber) {
subscriber(message);
this.subscribers.delete(message.id);
}
}
});
}
async send(method, params = {}) {
const id = ++this.messageId;
const message = {
id,
method,
params
};
return new Promise((resolve) => {
this.subscribers.set(id, resolve);
this.ws.send(JSON.stringify(message));
});
}
}
async function example() {
// Connect to Chrome's BiDi endpoint
// Chrome should be started with: --enable-bidi-protocol
const bidi = new WebDriverBiDi('ws://localhost:9222/session');
// Create a new context (tab)
const { result: { context: contextId } } = await bidi.send('browsingContext.create', {
type: 'tab'
});
// the code below here is what would be implemented inside a behavior...
// Set up network interception
await bidi.send('network.addIntercept', {
phases: ['beforeRequestSent'],
patterns: [{ urlPattern: '*example.com*' }]
});
await bidi.send('network.onIntercept', {
callback: (params) => {
if (params.phase === 'beforeRequestSent' && params.request.url.includes('example.com')) {
return {
action: 'block'
};
}
return { action: 'continue' };
}
});
// Navigate to a URL
await bidi.send('browsingContext.navigate', {
context: contextId,
url: 'https://google.com'
});
// Wait for element to appear
const script = `
new Promise((resolve) => {
const checkElement = () => {
const element = document.querySelector('input[name="q"]');
if (element) {
resolve(true);
} else {
requestAnimationFrame(checkElement);
}
};
checkElement();
});
`;
await bidi.send('script.evaluate', {
context: contextId,
expression: script,
awaitPromise: true
});
console.log('Search input found!');
}
// Run the example
example().catch(console.error);
A few potential benefits:
BehaviorBus
event forwarding and support for multiple contexts could be removed entirely in favor of just providing a bidi websocket directly in the page context (e.g. window.BIDI
and having all behaviors run in the context of window
. if they need a CDP/bidi command they can just call await window.BIDI.send(...)
from inside the page.I think the WebDriver BiDi standard is a very good initiative. I'd had a look but hadn't noticed the existence of the script.addPreloadScript command, that's the point that blocked me in the past with WebDriver. I'll have to do some testing but I'm interested in replacing the CDP client with a BiDi client. My basic need was to be able to provide executables that weren't too heavy. That's why I went down this road.
In the short term, I think I'll try to implement a library based on the Proxy
API.
Overall, I find the idea interesting! For my part, I think I could implement a
CDPCrawlDriver
class (using the Chrome Devtools Protocol under the hood) in single-file-cli. Now, let's imagine a userscript written by a user for ArchiveBox that depends onPuppeteerCrawlDriver
. Assuming the APIs of the twoCrawlDriver
classes are identical, If he wanted to run it in single-file-cli, would he be responsible for replacing the "puppeteer" occurrences with "cdp" in the userscript?