Major Refactoring for Enhanced Flexibility and Performance

maxime4000 commented 7 months ago

🚨 Warning

This is fully functionnal and shouldn't break anythings that weren't breaking, but it is not fully "finished/stable". I would like to have someone finishing the PR. Open to revert some possibly "controversial" changes, but not interested to fix the multi progressbar as I don't have deep knowledge into playwright and the logger.

Overview

This PR introduces a major refactoring of the codebase, adding support for multiple configurations and enhancing the overall performance and flexibility of the crawler.

Was talked briefly in #82

Key Changes

Multiple Configurations: Now supports handling multiple configurations, allowing more versatile use cases.
New Flexible Configuration Options:
- Exclude Selectors: Allows exclusion of content from selectors for targeted data extraction.
- Max Concurrency: Enables concurrent processing to optimize performance.
- Config Naming: Useful for organizing multiple configurations.
Improved Configuration Parsing: Configs now have default values set within the parser.
Output Directory Structure: Changed to output/, improving data management.
Multi Progress Bar Interface: Replaced individual page logging with a multi progress bar for clearer progress monitoring.
Cleaner Output: Streamlined the output for clarity.
Codebase Refactoring: Comprehensive overhaul to improve maintainability and efficiency.
Sub Routing Namings: Enhanced routing naming conventions.
File Path Changes: Updated output file path to output/data.json.
.gitignore Fix: Improved source control management.
Prettier Configuration: This update introduces a unique blend of traditional and modern coding styles, focusing on enhancing readability, clarity, and maintainability. A significant aspect of this configuration is the use of tab characters for indentation, prioritizing accessibility and offering better flexibility for visually impaired developers.
- Can also be reverted by deleting the config file and run npm run prettier or npm run fmt
- We should delete one of the 2.
- IMO: fmt

Partially Completed Features and Known Issues

Full CLI Support: Haven't use the CLI, should be tested.
Terminal Log Issues: INFO and ERROR logs not displayed correctly due to the multi-progress bar feature. They are showing up, but overwritten by the progressbar.
Progress Bar Wrapping: Issues with wrapping in narrow terminals.

Current State and Request for Collaboration

The changes made significantly improve the tool's capabilities. However, they are not entirely stable, particularly regarding terminal output. I seek collaboration from the community to resolve these issues, especially the log display discrepancies and progress bar wrapping.

Note :

I am open to make those changes as they might be controversial:

Revert Prettier config
Disable/Remove Multi Progressbar.
- Tho, I would prefer if someone would be willing to fix it. Maybe someone that has better understanding of terminal output that I do.
- My goal was to make understanding the current progress easier, but it introduce bugs with logging coming from playwright... Wasn't expecting this when I start the feature...

maxime4000 commented 7 months ago

@marcelovicentegc Thanks I appreciate :)

Prettier Tab Width: The 'tab width' setting in Prettier is not mandatory when using tabs. In such cases, the IDE's setting (if specified) determines the tab size, otherwise it defaults to 4. This means this settings is not important and could be removed. However, for languages like YAML or Python, where tabs are not allowed, Prettier would uses this value to insert 4 spaces. That being said, we could omit this setting and it would be using "the Prettier opiniated" default value. (Said that prettier support that file format in the first place)
Multi Progress Bar Changes: I can cherry pick these changes into a new Draft PR and assign it to you in the following days. However, I won't assist more than giving you hint when releasing the draft PR. (Might answering initial question if you have some)

Caveat of Reverting the Logger

When running multiple crawlers simultaneously, log outputs will mix, making real-time tracking challenging. This issue mainly arises with multiple configurations. But I'm also guessing that if you use this tools with multiple config, you kinda know what you are doing. 😉

Timeline for Changes

I plan to complete thoses changes by tomorrow or the next day. However, I'd prefer to have others review these changes before proceeding further.

The proposed changes are substantial, and I've made some important decisions independently. It's prudent to wait for feedback, especially from someone at BuilderIO, to ensure there's no disagreement or need for discussion about these changes before doing changes.

subframe7536 commented 7 months ago

Maybe the executor should also be changed. tsc is a little slow. How about using esbuild, like https://github.com/esbuild-kit/esno

maxime4000 commented 7 months ago

@subframe7536 While you do have a point about the speed of tsc and esbuild/swc would be a better transpiller, we could also says that when we run npm run start, there is no need to build and could be using tsx or ts-node instead.

While your point is an improvement, this PR is already doing too much and thus, I won't make that change here. Though, I suggest you to make a separate PR with this change as it would be faster if you are interested to bring that to the project.

maxime4000 commented 6 months ago

I'm sorry @marcelovicentegc, I couldn't make it...

I was optimistic about my own time and deadlines to do the things I promise to do. And I blinked, and it's been a month and couldn't put the time to finish it...

I think about it, I'm anxious about the promise of doing it but can't find the time to do it. Also, the facts are that my customs GPTs using the output produced by this PR wasn't any better than without it. In facts, chatgpt was slower, more prone to "network error" and didn't give better results than without "the knowledge".

Open AI did update their model to use more recent data and it made it feels like this PR was procrastination with my borrowed times as I should be using this time to build the things I want to build instead of improving a "random" tools that will scrape data to build knowledge files that will slightly help an AI to be more "accurate" when the AI is already able to help me without this data and also the output generated from the tools and the improvement I made didn't end up used because the experience was worse when those files were added to the custom GPT...

Motivation has been low as the beneficial gains of using the tools weren't substantial. I came here with a "take it or leave it" mentality, I promise to do better and support my work, but the urgency of finishing the work fall off short as I didn't used that work in the end...

I'm still planning to finish this PR, but I wanted to be transparent about this. I'm on borrowed times with my current life situation, I want to use that time to get the most of it and hopefully secure an independent prosper future. This was procrastination to my current goals. I wish it would have been more useful to me. I will take time to finish it someday, but it's not my priority at the moment.

Reviewing changes is not the same mental load as doing the changes and I'm still open about reviewing contributions to merge those improvements the sooner, but can't give an estimate of when will I do the promised changes.

On that note, I wish everyone who read this a good holiday season, an happy new years and to take time to take care of yourself 😊

BuilderIO / gpt-crawler