FlowiseAI / Flowise

Drag & drop UI to build your customized LLM flow
https://flowiseai.com
Apache License 2.0
31.83k stars 16.59k forks source link

[FEATURE] Web scrappers - ignore / remove some elements or add webpage content transformer #2327

Open bendadaniel opened 6 months ago

bendadaniel commented 6 months ago

Hello, I have a flowise workflow to web scrape our entire web (150+ pages) and then save it to Pinecone. We are currently using Cheerio Web scrapper node. (it could be Puppeteer, Playwright - it doesn't matter). We use setting 'Selector (CSS)': 'main' to ignore the header/footer of the page to scrape only valid data.

Problem: When I look at the data in Pinecone, I can see that there is a lot of invalid/unwated text data.

For example:

I have tried: I tried to extend 'Selector (CSS)' to something like this "main > :not(.headbar__video), main > :not(.headbar__video) *" but this doesnt work

Question: So do you have any idea if it is possible to somehow exclude some html elements from webscrape or transform result of the page before saving? I think there is no way in flowise now.

Idea: I can see that Puppeteer and Playwright have in code function "evaluate" where we could theoretically "transform" result of scraped page. Github Puppeteer code - evaluate function. So maybe there could be an option in flowise to add one input node to Puppeteer node (optional), which would accept for example 'Custom JS function node'. So evaluete function would pass data to this 'Custom JS function node', this function can transform data how user wants and then return updated data. Or something completely different, this was just my idea.

I think this would be good feature, because I think it is important to have good retrieval dataset without garbage data.

What do you think about it? Thanks Daniel

For example:

Screenshot 2024-05-05 at 22 00 01
toi500 commented 6 months ago

I have noticed the same with the Cheerio node today. It did not work as expected.

For your web of 150 pages I would recomend you to use Apify, and there, the actor called Website Content Crawler. It is free to set up and you get 5$ every month for scrapping. This scrapper provide really clean text from websites and the dataset can be filtered down and exported in json, xml, csv, etc.

image

Here is an example dataset_website-content-crawler_2024-05-06_01-56-33-249.json

HenryHengZJ commented 6 months ago

We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature

bendadaniel commented 6 months ago

We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature

You mean 'Custom Document Loader' node right?

Giusti10 commented 3 months ago

We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature

You mean 'Custom Document Loader' node right?

But today I can't use any of the Text Splitters for Custom Document Loader 😟