-
Hello!
I am writing my master's thesis on constructing corpora from the web. In it, I use the Common Crawl's WARC files to produce my corpus by extracting text and cleaning it myself.
The GloVe …
-
It seems that calling `warc-indexer` with thousands of WARC-files causes the `tmp` folder to fill up (maybe due to DROID temporary files). It should possible to clean up underway.
-
From WARC 1.1 section 5.6:
> (or ‘application/http; msgtype=request’ and ‘application/http; msgtype=response’ respectively)
Note the space after the semicolon. However the grammar immediately foll…
-
I have captured Instagram pages with Squidwarc and playing them in back in openwayback (latest Docker) triggers a script that seems to put the UI in an endless loop of reloads:
![openwayback_instag…
-
Let's define the format of out archives.
## Current state
A binary file that is actually just concatenated gzip blobs.
Features:
1. Extract gzip files
2. Append is trivial
## Prior art…
-
When trying to warc2text this massive (29G) file on cirrus `/beegfs/paracrawl/data/ia/wide00015-warcs/WIDE-20170107025349-crawl808/WIDE-20170107025349-02068.warc.gz` it is killed by the OOM killer.
…
-
### Description
When the ingest-attachment processor was first upgraded, and then packaged for use in Search cases, there was an intentional limiting of the file types supported. Now that we have s…
-
The GZIP spec includes support for one or more members `(A gzip file consists of a series of "members" (compressed data sets).`
but this spec currently states `A gzip stream may only contain one "me…
-
I could be reading wrong but this looks like the same post was crawled 3x.
```
$ zcat */tumblr-tumblr-blog_9volt-art-20181210-233052.warc.gz | strings | egrep -A10 '^WARC-Target-
URI: http://9vo…
-
I'm using Python 3.10.4 and warcio 1.7.4
Using a piece of code based on https://github.com/webrecorder/warcio#writing-warc-records, I'm getting
```
for record in ArchiveIterator(writer.get…