Open bendadaniel opened 6 months ago
I have noticed the same with the Cheerio node today. It did not work as expected.
For your web of 150 pages I would recomend you to use Apify
, and there, the actor called Website Content Crawler
. It is free to set up and you get 5$ every month for scrapping. This scrapper provide really clean text from websites and the dataset can be filtered down and exported in json, xml, csv, etc.
Here is an example dataset_website-content-crawler_2024-05-06_01-56-33-249.json
We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature
We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature
You mean 'Custom Document Loader' node right?
We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature
You mean 'Custom Document Loader' node right?
But today I can't use any of the Text Splitters for Custom Document Loader 😟
Hello, I have a flowise workflow to web scrape our entire web (150+ pages) and then save it to Pinecone. We are currently using Cheerio Web scrapper node. (it could be Puppeteer, Playwright - it doesn't matter). We use setting 'Selector (CSS)': 'main' to ignore the header/footer of the page to scrape only valid data.
Problem: When I look at the data in Pinecone, I can see that there is a lot of invalid/unwated text data.
For example:
I have tried: I tried to extend 'Selector (CSS)' to something like this "main > :not(.headbar__video), main > :not(.headbar__video) *" but this doesnt work
Question: So do you have any idea if it is possible to somehow exclude some html elements from webscrape or transform result of the page before saving? I think there is no way in flowise now.
Idea: I can see that Puppeteer and Playwright have in code function "evaluate" where we could theoretically "transform" result of scraped page. Github Puppeteer code - evaluate function. So maybe there could be an option in flowise to add one input node to Puppeteer node (optional), which would accept for example 'Custom JS function node'. So evaluete function would pass data to this 'Custom JS function node', this function can transform data how user wants and then return updated data. Or something completely different, this was just my idea.
I think this would be good feature, because I think it is important to have good retrieval dataset without garbage data.
What do you think about it? Thanks Daniel
For example: