-
https://github.com/subfinder/subfinder/tree/master/libsubfinder/sources/commoncrawl
-
I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:
```
$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.…
-
**CrateDB version**:
3.0.5 (but it seems any > ~~2.0.6~~ 2.3.11)
**Environment description**:
- Official Crate Docker Image, e.g. crate:3.0.5
- 2-nodes local swarm (tried with 3 also, but 2 …
-
I am running sample wordcount script in emr with cluster id `j-XXXXXX` as follows
python word_count.py README.rst -r emr --cluster-id j-XXXXXX
It gets failed. Error message is as follows
…
-
I train transformer model with en-fr data, I run it for several times but it seems deadlock when finish a batch at every time, log is as follow
[2018-09-19 20:47:48] Training started
[2018-09-19 2…
-
See https://www.parse.ly/help/integration/ppage/. This isn't used as much now as the open graph protocol, but in the past it was used a lot more. As a result, this provides value to anyone using this …
-
When reading a Common Crawl WARC file (e.g. crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz), when iterating to the second record, in clea…
-
imfht updated
6 years ago
-
```
ERROR:newsplease.crawler.commoncrawl_extractor:Document is empty
2018-06-16 14:23:49 [newsplease.crawler.commoncrawl_extractor] ERROR: Document is empty
Traceback (most recent call last):
Fi…
-
Hello!
I enjoy using your library and pretrained vectors. I see that for vectors that were trained on wiki you provide both binary model and pretrained vectors. However, for vectors that were trained…