-
The CC-NEWS contain the literal values of the HTTP header fields `Content-Encoding`, `Transfer-Encoding` and `Content-Length` although the payload is stored unchunked and uncompressed.
- the header …
-
The WARC standards recommends to mark records which have been truncated because of limits on the content size or fetch time by a field [WARC-Truncated](http://iipc.github.io/warc-specifications/specif…
-
Should add the remote target IP address as field "WARC-IP-Address" to CC-NEWS response records. Thanks, @wumpus for detecting this!
-
**In raising this issue, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your issue:**
- [X] I have read and understood the [contributors guide](https:/…
-
Hi
I am looking for model specifications such as how much data used for model training, such as NER, tagger and parser. For instance the NER on sm/md/lg English model trained on how much raw data? …
-
Reported on decksite by decksite-perf
--------------------------------------------------------------------------------
Request Data
```
Request Method: GET
Path: /news/?
Cookies: {}
Endpoint: news
V…
-
**Describe the bug**
I have cloned repository and installed all the necessary libraries stated in requirements.txt and others like hurry after tried to run newsplease.examples.commoncrawl.
Last erro…
-
Hi, very potentially useful tool!
Running it on a Macbook Pro 2018 over a 3 year time range with 3 sources generates 1 article per second. We want millions of articles that contain a keyword, so we…
jmy48 updated
5 years ago
-
I am trying to use the commoncrawl.py script to do pull news articles from Common Crawl, but am receiving a very persistent HTTPError. I made sure I have aws-cli installed via pip (I am using Conda),…
-
I noticed that for `boto.s3.key.Key` you use `io.BufferedReader`.
boto3's `botocore.response.StreamingBody` has the attribute `_raw_stream` which it looks like you can just pass to `io.BufferedRea…