-
Hello All,
I am trying to run SentAugment as a part my project for clustering purposes but facing multiple issues trying to run it. I am using a part of the CommonCrawl data for this purpose.
…
-
### Description
When I tried to run the example about the EN-DE translation, I obtained this error
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme 'http' not impleme…
-
While studying data pipelines, I found CCNet. CCNet is very intriguing to me. I'm going to use CCNet to create a better data pipeline for Korean datasets.
I have a question. In the paper, it is state…
-
Hello all,
over on [Flair](https://github.com/zalandoresearch/flair) we noted that the Japanese Wikipedia Embeddings may not be meaningful (see Issue #[336](https://github.com/zalandoresearch/flai…
-
## Is your feature request related to a problem? Please describe.
I am working with filtered downloads of the Common Crawl dataset (~100TB, with plans to grow to ~200TB), so auto-indexing all collect…
-
For those still looking for a (team) project idea, HyperLogLog is an interesting probabilistic data structure that is worth studying.
https://chengweihu.com/hyperloglog/
It would benefit from a …
-
#### URL of the results:
https://uidemo.commonsearch.org/?g=fr&q=gouv.fr
#### Describe the issue precisely:
Almost all french gov websites are missing.
Some examples :
http://www.nord.gouv.fr/
http…
-
* Readme: "These 83 languages are detected"
* python (import cld2; help(cld2)): 161 languages (175 language-script combinations), 240 total language-script combinations
* python (import cld2; print(…
-
For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determ…
-
when i am trying to give input path (below path) to hadoop ,i am getting "Error: org.jets3t.service.impl.rest.httpclient.RestS3Service.(Lorg/jets3t/service/security/AWSCredentials;)V
" Error
Input p…