BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.16k stars 1.88k forks source link

How to Choose a Suitable CSS Selector for a Website #53

Open kongjining opened 7 months ago

kongjining commented 7 months ago

Inspecting Web Page Structure:

Open the target website (e.g., https://www.google.com.hk/webhp?hl=zh-CN&sourceid=cnhp/). Right-click on the page element you wish to crawl (such as a specific text or area) and select "Inspect" to open the browser's developer tools. Analyzing the Element:

In the developer tools, examine the HTML code of the element. Look for attributes that uniquely identify the element or its container, such as class, id, or other attributes. Building a CSS Selector:

Create a CSS selector based on the attributes you observed. For example, if an element has class="content", the selector could be .content. If the element has multiple classes, you can combine them like .class1.class2. Testing the Selector:

In the "Console" tab of the developer tools, use document.querySelector('YOUR_SELECTOR') to test if the selector accurately selects the target element. Applying the Selector:

Once a suitable selector is found, apply it in the selector field of your crawler configuration. Ensure that the chosen CSS selector accurately reflects the content you wish to extract from the webpage. An incorrect selector might result in the crawler not being able to retrieve the desired data.

bigshirtjonny commented 7 months ago

Something I've seen is that the selector doesn't exist on one (or first) page of the crawl then the crawl will end with error. How can we configure the crawl so that if a selector doesn't exist for one page that GPT will continue to try the next page.