BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.15k stars 1.88k forks source link

Multiple concurrent crawler with split output. Asking if there is interest in completing my Fork. #82

Open maxime4000 opened 7 months ago

maxime4000 commented 7 months ago

@steve8708 Questioning interest, I have made a big refactoring of the codebase for integrating thoses features :

My needs:

I wanted to create a knowledge base for godot, but wanted to separate each section into their own files. I manage to do it with multiple config. But that being done and I have the output I needed, I am not interested in fixing the logging part. Useful when I saw some error from a bad error, but not that helpful imo.

Current state

So the current changes are big and 90% finish. Nonetheless, I think they are an improvement, just not a "fully stable" and completed improvement... Everythings that was added is very functionnal, but I still have issues with the output of the terminal. If the lines get wrapped, the output get ugly. Nx has a similar issue with their run-many CLI, so I don't know if it's vscode, the terminal or the lib... I'm just not interested in completing the feature.

> @builder.io/gpt-crawler@0.0.1 build > tsc

Crawling started. ████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | getting_started | 10/33 (L: 50, F: 33) | ETA: 101s | /getting_started/step_by_step/instancing.html ███████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | tutorials | 9/50 (L: 50, F: 327) | ETA: 268s | /tutorials/best_practices/godot_interfaces.html ████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | contributing | 9/47 (L: 50, F: 47) | ETA: 248s | /contributing/development/index.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6323,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":56909,"requestsTotal":9,"crawlerRuntimeMillis":60560,"retryHistogram":[9]} ████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed ████████████████████████████████░░░░░░░░ | getting_started | 26/33 (L: 50, F: 33) | ETA: 28s | /getting_started/first_3d_game/03.player_movement_code.html
█████████████████████░░░░░░░░░░░░░░░░░░░ | tutorials | 26/50 (L: 50, F: 327) | ETA: 91s | /tutorials/editor/managing_editor_features.html ████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed █████████████████████░░░░░░░░░░░░░░░░░░░ | contributing | 26/50 (L: 50, F: 57) | ETA: 92s | /contributing/development/debugging/using_sanitizers.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4464,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":116054,"requestsTotal":26,"crawlerRuntimeMillis":120568,"retryHistogram":[26]} ████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed ████████████████████████████████████████ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed ██████████████████████████████████████░░ | tutorials | 47/50 (L: 50, F: 327) | ETA: 8s | /tutorials/3d/procedural_geometry/arraymesh.html ████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed ███████████████████████████████████░░░░░ | contributing | 44/50 (L: 50, F: 73) | ETA: 19s | /contributing/documentation/class_reference_primer.html INFO Sta ████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed ████████████████████████████████████████ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed ████████████████████████████████████████ | tutorials | 50/50 (L: 50, F: 327) | ETA: 0s | Completed ████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed ████████████████████████████████████████ | contributing | 50/50 (L: 50, F: 73) | ETA: 0s | Completed

I made this multi progress bar because with concurrent crawling, the log was hard to follow. With this, it's easier to follow, but when logging things happen like error, info and other in the mean times, it's a mess...

The issue :

When this "type" of line appear from PlaywrightCrawler, it break the multi progressbar :

INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4525,"requestsFinishedPerMinute":12,"requestsFailedPerMinute":0,"requestTotalDurationMillis":113118,"requestsTotal":25,"crawlerRuntimeMillis":120511,"retryHistogram":[25]}

The multi progressbar display get bugged. I do not understand enough terminal and playwright to know exactly what to change to fix this.

Why Asking ?

I have no interest in fixing the terminal as I got what I wanted, but the whole changes is a improvement and I was asking if I could make a PR and let someone else fix the issue in the PR and push it ? I guess the concurrent part could be omitted and that would "make the PR completed".

Other changes that I can omit if not wanted.

I use a "modern" prettier config, my editor will format using my config if none existe in the repo I work on. I have setup prettier as I was already changing formatting when I saving, but I'm ok with reverting this. But I could also push it if thecopied some files that would configure that as I wasn't planning to make big change, but I'm willing to remove that too if not interested.

Here's some visual preview :

image image