issues
search
chfoo
/
warcat
Tool and library for handling Web ARChive (WARC) files.
GNU General Public License v3.0
150
stars
21
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add 'warcat' console_scripts entry point; also ignore *.egg-info
#25
dlitz
opened
2 years ago
0
http.client.BadStatusLine: http/1.1 200 OK
#24
chris-aeviator
opened
3 years ago
0
Add force_gzip flag to WARC.load to fix #20
#23
acrois
closed
3 years ago
0
No mention of 'resource' in list at verify_refers_to
#22
RvanVeenendaal
opened
4 years ago
2
[Merged OK] Add target uri filter
#21
JesseWeinstein
closed
4 years ago
1
pass on warc.gz error
#20
marked
closed
3 years ago
1
Malformed HTTP headers lead to "ValueError: need more than 1 value to unpack" crash
#19
JustAnotherArchivist
opened
5 years ago
1
wpull WARCs cause "Content block length changed from X to Y" warnings on warcinfo record
#18
JustAnotherArchivist
opened
5 years ago
0
Use errors='replace' when decoding HTTP headers
#17
Frogging101
closed
7 years ago
1
Handling for "files" that are purely in memory?
#16
spott
opened
7 years ago
2
Support payload digest of revisit records
#15
Arkiver2
opened
8 years ago
1
Add easy way to iterate over warc records
#14
sirex
opened
8 years ago
3
URL agnostic deduplication of WARC
#13
Arkiver2
opened
8 years ago
0
'utf-8' codec can't decode byte invalid continuation byte
#12
fanchyna
closed
7 years ago
1
A name to a file object is not handled correctly
#11
chfoo
opened
8 years ago
0
Reading in an in-memory gzip.GzipFile object breaks warcat.model.binary.BinaryFileRef objects
#10
d-m
closed
8 years ago
3
Extract performance is extremely slow on megawarcs
#9
gwern
opened
8 years ago
1
Feature: extract only files matching a regexp
#8
gwern
opened
8 years ago
0
Feature: extract WARCs specified with index/length
#7
gwern
opened
8 years ago
1
http.client.IncompleteRead crash during extract
#6
chfoo
closed
10 years ago
1
Handle long filenames
#5
chfoo
closed
10 years ago
1
Support warnings when Content-Type doesn't match what cdx-writer expects
#4
chfoo
closed
10 years ago
1
Support warnings when WARC field name casing don't match hanzo's warc-tools.
#3
chfoo
opened
10 years ago
1
Support older Python 2.7
#2
chfoo
opened
10 years ago
2
Fields with empty values in metadata records increases block length
#1
chfoo
closed
10 years ago
0