-
I tried crawling with a couple of sites using the nutch crawler. It shows that it has crawled ~13000 pages. When i click on the visualize button, the kibana dashboard says i have to configure an index…
-
When ArchiveBot hits a .swf file, it should decompile it and search for URLs in the ActionScript. This may be tricky to implement, but it would fix most problems that come with archiving Flash-based s…
-
Current Database design has two blockers for site extensibility.
1. Every "new site support addition" needs new columns to be added to **USER** database table (NEWSITE_handle and NEWSITE_lr) for add…
-
Currently, the script is written as `CLI (command line interface)`.
If you want to try the script, open terminal, and then run the script like the followings:
```python arg_clawer.py -url http://xxx…
-
### Please describe your feature request:
Things like angular,react,etc
### Describe the use case of this feature:
-
Hi @marevol
I have checked FESS respects Disallow for robots.txt but i am unable to verify Crawl-delay and Request-rate. Can you please confirm is it implemented?
https://www.promptcloud.com/blo…
-
Path `ocse-core/coast_to_coast/coast_to_coast/spiders/robots_txt.py`
This spider should take a URL (e.g. https://example.com) and go to its `robots.txt` file (e.g. https://example.com/robots.txt). …
-
I am in the process of researching archiving tools/techniques for an investigation tool. It's amazing both the amount and scattering of different tools.
Plain static archiving is out of the questio…
-
* **I'm submitting a ...**
[ ] bug report
[X] feature request
[ ] question about the decisions made in the repository
[ ] question about how to use this project
* **Summary**
While we already …
motin updated
3 years ago
-
**The Problem:**
The PathFinder/PathTracker components responsible for building the "path" navigation across web links from page to page starting from the "root site URL" (rootPath) have two issues:
…