-
Thank you for sharing. I would like to ask where your data comes from.
-
Some link titles and descriptions cannot be correctly obtains. This is because of cloudflare, and other protection mechanisms.
Some entries were edited manually.
If page changes drastically. For e…
-
The webcrawlers have been merged on to the EC2, however the Shadow Seals crawler does not require an EC2. Therefore, it should be split from OFA, and the EC2, then moved over to it's own lambda.
Pl…
-
Add crawl spiders for the following or popular websites.
- Youtube
- Quora
- Facebook
- Reddit
- GitHub
Currently implemented spiders can be found in - https://github.com/leopardslab/CrawlerX/…
-
Due to articles using MD, there is an obvious contextual disconnect between what is in the MD and what is actually rendered on the page. An article can container a markdown heading level 1, but the pa…
-
I realized that when you search for text in the coursehero document on google, it has the problem and answer in the page description. That means coursehero has a text version of the pdf openly availab…
-
## 💡 Description
Current bot detection routine is fairly basic and rule based. Create a more complete solution to detect web crawlers and bot interaction with PDS Nodes.
-
### What problem are you trying to solve?
Currently, there is no standard way for webpages to declare tasks that AI assistants can perform on their content. This leads to an inconsistent and fragment…
-
### Project Description
Develop a tool that reads in the web logs of an ERDDAP server to analyse how the server is being used. This would include:
- filtering out bots/crawlers/spam
- analysing a…
-
The default configuration use 80 and 443 for both the container and the host machine.
Is it possible to run on a non-default port e.g. 1234 for HTTPS? We don't want Internet crawlers and malicious …