-
It would be nice to add to Frontera an optional crawling strategy for topical crawling. It could take dictionary of words describing some topic as input and crawl from seed urls searching for document…
-
Hey @adbar,
I am currently testing the spider function, and I have encountered an issue when attempting to use a category URL to fetch posts specifically from that category.
Here is the code sni…
-
[AWS Glue](https://aws.amazon.com/glue/features/) seems really useful especially it's fuzzy FindMatches feature, ([although LLM based cosine similarity embeddings should provide similar features](http…
-
- [ ] Talk about the complexity of the algorithm running tim used.
- [x] Web characterization **[6]**
- [x] Methods for sampling, Web dynamics, Estimating freshness and age, Characterization of We…
-
### Pitch
In August, GPTbot block was merged into the code https://github.com/mastodon/mastodon/pull/26396. Now, Google has a robots.txt policy for Bard and future Google AI models with user agent Go…
-
I'm using docker-compose on windows 10
when I run "docker-compose up " everything works fine
elasticsearch works fine, DDT tool works fine but the crawler won't work, when I use deep crawling and us…
-
**Is your feature request related to a problem? Please describe.**
The Ultimate aim of the project is to detect illicit websites. As of now the algorithm uses graph knowledge to target suspicious lin…
-
```
from trafilatura.spider import focused_crawler
class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
def can_fetch(self, useragent, url):
# Always return True to allow…
-
Whereas the code for Yserbius and Twinion can only run Yserbius style games (and even more specifically, only Yserbius and Twinion), Cawdor runs on a new client/server dungeon crawler engine that is s…
-
I would love to create a gpt out of a github repo. Can you please add this?
K thx bai