Open cpdata opened 7 months ago
I updated with prettier formatting for the files that failed README.md, src/config.ts, src/core.ts, and config.ts. I also added the formatting for jsdoc/typedoc as recommened by @marcelovicentegc in response to my orginal pull request #102. Additionally, I added .prettierignore file.
@marcelovicentegc this look good to you to merge?
@marcelovicentegc this look good to you to merge?
Hey @steve8708! Happy new years! One rebase and a few nitpicks ☝️ and it occurs to me that we are good to go 🤗
Please merge this branch ASAP!
Initial Improvements
Main Additions
maxPagesToCrawl
if = 0 then will crawl all matching urls and display during progress as 1/∞.maxConcurrency
Sets the number of concurrent crawl requests. If left unset then theundefined
maxConcurrency will do maximum parallel connections like the originals default. Now defaults to 1 to avoid getting IP banned.waitPerPageCrawlTimeoutRange
Defaults to a range of 1 second to 1 second but can be set to create a random delay between any 2 numbers in milliseconds to avoid rate limit rejection when crawling.headless
istrue
by default but can now be configured in the config.ts file for situations that require it.Full Summery
Added *.code-workspace to .gitignore for VSCODE workspaces saved in the root of the project. Add VSCode workspace file in .gitignore
Final output .json files go to
outputs/
folder so they are not overwritten. Add outputs dir to .gitignore for final outputsDynamic domain + date-timestamp final output file name ex. outputs/domain.com-2023-11-28-12:02:51.json Add Dynamic OutputFileName based on date-timestamp
maxPagesToCrawl
: if set to 0 will continue crawling for all matching URLs and display infinity symbol ex. 1/∞, 2/∞, 3/∞ etc.( default = 50 ) Allow maxPagesToCrawl to be optional and infinite by setting 0 which will display the infinity symbolmaxConcurrency
: Some sites will automatically block connections to prevent DDOS attacks. This config sets how many concurrent requests run at a time. ( default = 1 ) Added maxConcurrency config to set maximum concurrent parallel requests.Updates to core.ts to add config paramters for maxPagesToCraw, maxConcurrency, maxRequestsPerCrawl, headless
waitPerPageCrawlTimeoutRange
config added to set a random range in milliseconds between requests. Some sites will automatically block connections so this is a 2 number object that introduces a random delay between requests for rate limit handling( default = 1000 ) Update to core.ts for maxPagesToCrawlheadless
is now a config option ( default = true ) Addded headless mode as a configuration parameterRandom Rate Limiting Range with
waitPerPageCrawlTimeoutRange
config. Added waitPerPageCrawlTimeoutRange for a random range in milliseconds between page requests to help with rate limiting1 line improvement to prevent VSCODE warning for non-existent docker container. Added ts-ignore for docker config.ts to prevent VSCode from declaring missing file that isn't created until the Docker is.
Chunked data goes into the
storage
dir. Final compiled JSON file outputs go into the newoutputs
directory. Added Output Directory for all outputFileName to go into so they aren't overwritten in storageAdded more variables to the ./config.ts file for setting up the config in a more customized way that also includes the automatic naming convention domain-timestamp.json Additions to dynamic url and match configurations in config.ts
Added details for waitForSelectorTimeout in the README.md file Added waitForSelectorTimeout to README.md
Added additional Markdown and Typescript formatting to the config.ts and README.md files. Adding details to README.md and config.ts as well as extra formatting.
13 Commits hopefully makes review a little easier.
I would like to contribute to this project on a regular basis. I have a lot of Web-scraping, A.I./LLMs, CI/CD, Automation, experience and would like to discuss with the main collaborators and see were I can be of the most use.