-
Hi kaggle team, would be great to have this python package available,
https://github.com/webrecorder/warcio
which is used to read the Web ARChive format which is used by Common Crawl to store t…
-
Hello- I am trying to recreate the en_core_web_lg model https://github.com/explosion/spacy-models/releases//tag/en_core_web_lg-2.2.5 model by following the steps in the model description and assets fo…
-
For example,
In,
`cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de
`
Is there a table or some other source for what zh_HK, zh_yue, yue, etc. represent?
Is zh_yue is different…
-
Note that we are currently integrated via CommonCrawl, but would like to switch to an API integration. Also note that we previously had issues getting correct license information from the page, and as…
-
Hello,
It's more a suggestion than an issue. I have recently installed subfinder and as a passive source, i saw ```commoncrawl```.
However, subfinder is requesting the following index: …
-
So far as I can tell https://github.com/bitextor/bitextor/blob/master/bitextor-tokenize.py and https://github.com/bitextor/bitextor/blob/master/bitextor-tokenize-moses.py launch new processes for ever…
-
This namespace element was motivated by a situation where a podcast approached podcastindex, noting that they had _two_ feeds for their podcast (one for most people, another motivated by their Chinese…
-
UPDATE: I took two different approaches to fixing this, please see PR #446 and #447. #446 applies a single mutex in the problematic function. #447 adds a new string set implementation that is inherent…
-
I am trying to write a web server that serves that serves up search request for an otherwise static website. I am using a fork of Zola that uses tantivy (version 12.0) to create a search index with al…
-
### Description
I tried to download the **wikisum** dataset used in the paper GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES and wanted to use my own computer to do it instead of GCP. I executed …