-
Given the features of commit 514754b4c40517fce08236202a0a9830353368bf and the idea that this map might possibly be taking in data from US Census Bureau, FBI (#29), and National Telecommunications and …
-
Many other domains were found that are owned by Admiral and point to the same IP as #1. There's a list at https://pgl.yoyo.org/adservers/admiral-domains.txt
-
This proposes some tooling for large datasets. **Warning!** As soon as I wrote it, i already want to change it. in particular, I want to change the `db` thing to just be a normal ipfs repo. it would h…
-
We have difficulties with our Do Not Sell link identification.
1. We have [difficulties improving the accuracy of the Do Not Sell link identification](https://github.com/privacy-tech-lab/gpc-web-craw…
-
While browsing the users list, I noticed that there are some users that were created just for spam, like this one: https://tatoeba.org/ita/user/profile/mingletrain
Maybe the code should be refined to…
-
I am getting that, and seems others are also:
https://github.com/gigablast/open-source-search-engine/issues/199#issue-1550056077
-
We need a small stand-alone web UI that ties in with the rest components in #24 to visualize the data generated by the cluster. You should also be able to submit API requests to the cluster.
Preferab…
-
Hi Pascal,
I have a question regarding crawling through an internal sharepoint site. It seems like everytime I go through the internal links I get a 403 forbidden, although I have setup the login a…
-
When I try to crawl https://www.cnn.com/2024/10/05/weather/tropical-storm-milton-florida-gulf-of-mexico/index.html, the crawler gets stuck with this output until I exit via Ctrl + C.
```
#### …
-
### not all are musts
**for discussion (some will be post 1 June development)**
- [x] anything tagged (sg comments?)
- [x] scaling of orange markers based on some feature density measure
**home page…