Open aaa3334 opened 1 month ago
Hello I'm also new to web scraping but puppeteer-extra works for me fine and I only use to get the headers nothing much so I don't rly mind, if it causes some problems then I might switch it to playwright . for NODE_URL it's just a node server that get the request header using puppeteer since it doesn't seem to work on nextjs
Thanks for your reply! Ah ok - what is the code you are using at the /api/tiktok-headers?
I tried using puppeteer-extra but anytime I am importing it I get this: https://nextjs.org/docs/messages/module-not-found ⨯ ./node_modules/yargs/build/index.cjs Module not found https://nextjs.org/docs/messages/module-not-found ⨯ ./node_modules/yargs/build/index.cjs Module not found https://nextjs.org/docs/messages/module-not-found
Which was why I was looking at playwright instead (but haven't figured out how to get the headers properly yet)
I am also wondering about the cookie - some others seem to set it but you have left it blank many places - is that something I should also be filling in with mine? Or is it not required?
FYI managed with playwright but now it is not accepting the credentials and goes on a loop to get new ones.
Is there a login or anything else needed for this? I get the headers and try and fetch some adverts but get: too many requests with a 40100 error code.
Is that what you are getting too?
The code for playwright is below incase you wanted to try (chromium didnt work so used firefox instead)
export async function getCredentials() {
const { firefox } = require("playwright");
try {
let userSign: string | null = null;
let webId: string | null = null;
let timestamp: string | null = null;
const browser = await firefox.launch({
headless: true,
});
const context = await browser.newContext();
const page = await context.newPage();
// Listen for all network requests
page.on("request", async (request) => {
const url = request.url();
if (url.includes("/creative_radar_api/v1/top_ads/v2/list")) {
// console.log("Request URL:", url);
const headers = request.headers();
// console.log("Request Headers:", headers);
userSign = headers["user-sign"] || null;
webId = headers["web-id"] || null;
timestamp = headers["timestamp"] || null;
}
});
// Listen for all network responses
page.on("response", async (response) => {
const url = response.url();
if (url.includes("/creative_radar_api/v1/top_ads/v2/list")) {
// console.log("Response URL:", url);
const headers = response.headers();
// console.log("Response Headers:", headers);
// Capture the response body and search for the timestamp
const responseBody = await response.text();
// console.log("Response Body:", responseBody);
if (responseBody.includes("timestamp")) {
const timestampMatch = responseBody.match(/"timestamp":"(\d+)"/);
if (timestampMatch) {
timestamp = timestampMatch[1];
}
}
}
});
await page.goto(
"https://ads.tiktok.com/business/creativecenter/inspiration/topads/pad/en?period=30®ion=TR&secondIndustry=25300000000%2C25304000000",
{
waitUntil: "networkidle",
}
);
// Wait for up to 10 seconds to ensure all requests are captured
const maxTimeout = 10000; // 10 seconds
const checkInterval = 500; // Check every 500ms
let elapsedTime = 0;
while ((!userSign || !webId || !timestamp) && elapsedTime < maxTimeout) {
await page.waitForTimeout(checkInterval);
elapsedTime += checkInterval;
}
await browser.close();
if (!userSign || !webId || !timestamp) {
throw new Error(
"Failed to capture all required headers within the timeout period"
);
}
console.log("userSign", userSign);
console.log("webId", webId);
console.log("timestamp", timestamp);
await saveCredstoSupabase({
userSign: userSign!,
webId: webId!,
timestamp: timestamp!,
});
return {
userSign: userSign,
webId: webId,
timestamp: timestamp,
};
} catch (error) {
console.log("There is an error getting credentials:", error);
}
}
async function saveCredstoSupabase({
userSign,
webId,
timestamp,
}: {
userSign: string;
webId: string;
timestamp: string;
}) {
const supabase = createClient();
await supabase.from("tiktokheaders").delete().neq("id", 0);
const { error } = await supabase.from("tiktokheaders").insert([
{
userSign: userSign,
timestamp: timestamp,
webId: webId,
},
]);
}
not rly sure for the cookie yes I was logged in and I did set my acc cookies but I think it should work fine without auth ( I can try you code later tonight or tomorrow and let you know )
(also haven't tried the code for 1 week so it might be broken due to something changed by tiktok )
add me on discord so we don't make this clumped : smiloxham
Hi! Just wondering if this is meant to be public or private? Looks like a lot of cool stuff in here! If it is meant to be public I have 2 questions: I am wondering regarding the scraping if puppeteer-extra is still maintained? I saw things saying playwright was the better option now but I am new to web scraping so I am not really sure I am also wondering what process.env.NODE_URL is meant to be?
Thanks in advance!