-
Application: [v5 Notary Allocator Application: Open Public Dataset Pathway](https://github.com/filecoin-project/notary-governance/issues/996)
Latest compliance report: [Compliance Report - 2024-10-14…
-
The library used by Tika already spots Welsh, but needs to be [taught](https://github.com/optimaize/language-detector#how-you-can-help) to spot [Scots Gaelic (gd)](https://en.wikipedia.org/wiki/Scotti…
-
Hi,
I am currently working on machine learning project.
I decided to use newspaper3k library to get articles by dates.
I use cnn.com, nytimes.com, and fox.com to get articles.
However, they usual…
-
- uid: unsupervised_cross_lingual_representation_learning_at_scale
- type: processed
- description:
- name: Unsupervised Cross-lingual Representation Learning at Scale
- description: This pap…
-
#### URL of the results:
https://uidemo.commonsearch.org/?g=en&q=facebook
#### Describe the issue precisely:
Not sure why. Other homepages don't appear because they redirect `/` to something else (l…
-
Hello! I am a grad student and my research deals with networking. I just discovered this repo and FireHOL, and I think this is an awesome resource. I was disappointed to learn that Git and GitHub have…
-
Brave bundles ~500 domains that we think users might want autocompleted, so that some URL bar entires can be autocompleted for the user w/o a network request.
https://github.com/brave/brave-core/pu…
-
On the website (https://data.commoncrawl.org/contrib/datacomp/index.html) only the full 280TB pool is available. Furthermore, https://github.com/mlfoundations/dclm/tree/main/exp_data/datasets/ has sev…
-
In addition to CDXJ, the [ZipNum format](https://github.com/ikreymer/pywb/wiki/CDX-Index-Format#zipnum-sharded-cdx) uses a secondary index, which also includes a sortable url key but contains other da…
-
## List of candidate packages
- Mozilla's [Readability.js](https://github.com/mozilla/readability) Node.js package. Some existing Python wrappers, but they seem to be based on an older version of Rea…