Open nynewco opened 7 months ago
ft.com/*usa*
I tried 2 versions, but none of those worked per your recommendation. Did I get something wrong?
url: "https://www.ft.com", match: "ft.com/usa",
url: "https://www.ft.com/", match: "ft.com/usa",
I tried 2 versions, but none of those worked per your recommendation. Did I get something wrong?
url: "https://www.ft.com", match: "ft.com/usa",
url: "https://www.ft.com/", match: "ft.com/usa",
You need to use regex pattern matching inside the match value.
import { Config } from "./src/config";
export const defaultConfig: Config = {
url: "https://api.python.langchain.com/en/stable/langchain_api_reference.html",
match: "https://api.python.langchain.com/en/stable/**",
maxPagesToCrawl: 50,
outputFileName: "output.json",
};
So in your case, it would be:
import { Config } from "./src/config";
export const defaultConfig: Config = {
url: "https://www.ft.com/",
match: "https://www.ft.com/**",
maxPagesToCrawl: 50,
outputFileName: "output.json",
};
This worked great!
How do I get it to search all the other 13 or 14 different pages (see bottom of page) on a website such as this ...
https://midlibrary.io/categories/illustrators
I noticed that pages 2 and 3 and on are like this:
https://midlibrary.io/categories/illustrators?226b50f6_page=2 https://midlibrary.io/categories/illustrators?226b50f6_page=3 https://midlibrary.io/categories/illustrators?226b50f6_page=4
The way you're writing it is inefficient for the lack of wildcard usage.
By the way you asked for help, the solution would be:
import { Config } from "./src/config";
export const defaultConfig: Config = {
url: "https://midlibrary.io/categories/illustrators",
match: [
"https://midlibrary.io/categories/illustrators?226b50f6_page=2",
"https://midlibrary.io/categories/illustrators?226b50f6_page=3",
"https://midlibrary.io/categories/illustrators?226b50f6_page=4",
],
maxPagesToCrawl: 50,
outputFileName: "output.json",
};
url
But because the match
URLs all contain 226b50f6_page
then you should just be using wildcards:
import { Config } from "./src/config";
export const defaultConfig: Config = {
url: "https://midlibrary.io/categories/illustrators",
match: "https://midlibrary.io/categories/illustrators?226b50f6_page=**",
maxPagesToCrawl: 50,
outputFileName: "output.json",
};
You need to learn pattern matching.
How to search all URLs with a certain word in it?
Eg. the word usa anywhere within any url within ft.com site?
ft.com/usa/xyz ft.com/today/opinion/usa ft.com/today/articles/usa
Is this the Selector? If so, how do you do it?