warc Search Results - Githubissues

1000+ results
for warc

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

stanfordnlp/GloVe #74

The Common Crawl models: What data sets are they based on?

Hello! I am writing my master's thesis on constructing corpora from the web. In it, I use the Common Crawl's WARC files to produce my corpus by extracting text and cleaning it myself. The GloVe …

kjetilbk updated 6 years ago
3
ukwa/webarchive-discovery #252

Clean up temporary files underway

It seems that calling `warc-indexer` with thousands of WARC-files causes the `tmp` folder to fill up (maybe due to DROID temporary files). It should possible to clean up underway.

tokee updated 1 year ago
5
iipc/warc-specifications #38

Content-Type grammar inconsistent with examples

From WARC 1.1 section 5.6: > (or ‘application/http; msgtype=request’ and ‘application/http; msgtype=response’ respectively) Note the space after the semicolon. However the grammar immediately foll…

ato updated 2 years ago
1
iipc/openwayback #389

Problem playing back captured Instagram pages

I have captured Instagram pages with Squidwarc and playing them in back in openwayback (latest Docker) triggers a script that seems to put the UI in an endless loop of reloads: ![openwayback_instag…

peterk updated 5 years ago
4
killercup/static-filez #9

Define the archive format

Let's define the format of out archives. ## Current state A binary file that is actually just concatenated gzip blobs. Features: 1. Extract gzip files 2. Append is trivial ## Prior art…

killercup updated 4 years ago
10
bitextor/warc2text #20

warc2text tries to read large warc records into memory, caus…

When trying to warc2text this massive (29G) file on cirrus `/beegfs/paracrawl/data/ia/wide00015-warcs/WIDE-20170107025349-crawl808/WIDE-20170107025349-02068.warc.gz` it is killed by the OOM killer. …

jelmervdl updated 3 years ago
1
elastic/elasticsearch #104833

Enable configuration of additional types in attachment proce…

### Description When the ingest-attachment processor was first upgraded, and then packaged for use in Search cases, there was an intentional limiting of the file types supported. Now that we have s…

serenachou updated 9 months ago
1
whatwg/compression #42

Support for decompressing multi-member gzip files?

The GZIP spec includes support for one or more members `(A gzip file consists of a series of "members" (compressed data sets).` but this spec currently states `A gzip stream may only contain one "me…

ikreymer updated 1 month ago
2
ArchiveTeam/tumblr-grab #19

Repeat crawl of same link

I could be reading wrong but this looks like the same post was crawled 3x. ``` $ zcat */tumblr-tumblr-blog_9volt-art-20181210-233052.warc.gz | strings | egrep -A10 '^WARC-Target- URI: http://9vo…

marked updated 5 years ago
3
webrecorder/warcio #143

Documentation: Clarify that capture_http writer with filenam…

I'm using Python 3.10.4 and warcio 1.7.4 Using a piece of code based on https://github.com/webrecorder/warcio#writing-warc-records, I'm getting ``` for record in ArchiveIterator(writer.get…

voltagex updated 2 years ago
3

上一页 1...26 27 28 29 30 31 32...100 下一页

1000+ results for warc

1000+ results
for warc