warc Search Results - Githubissues

1000+ results
for warc

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

oduwsdl/ipwb #790

Replay arbitrary WARCs through subdomain/subpath CID inclusi…

When looking at this project, I saw that dynamically linking to an "archive" of a website via URLs, if/after I set up IPWB on a subdomain or a website, is not really possible, as IPWB sets itself up t…

ShadowJonathan updated 1 year ago
4
bigscience-workshop/data_tooling #298

Crawling curated list of sites: BigScience catalog app URLs

We want to be able to obtain all web and media content associated with a specific list pre-identified domain names. This issue tracks domain names identified in the [**BigScience Data Cataloging Ev…

yjernite updated 2 years ago
2
lintool/warcbase #140

Re-upgrade Guava (UKWA's WARC Hadoop indexer dependency)

I had to downgrade Guava to accomodate UKWA's WARC Hadoop indexer: https://github.com/lintool/warcbase/blob/master/pom.xml#L277 But this issue now appears to be fixed: https://github.com/ukwa/webarch…

lintool updated 9 years ago
2
internetarchive/warcprox #101

concurrency bug when running with multiple warc writer threa…

In July @vbanos reported invalid gzip data in a warc written by warcprox with `--writer-threads=5`. My benchmarking suggests that 1 writer thread is optimal: https://github.com/internetarchive/war…

nlevitt updated 6 years ago
2
marked/yahoo-group-archiver #2

Support single file archive

.zip or .tgz or .warc

marked updated 5 years ago
1
machawk1/warcreate #31

Provide ability to create WARCs that are tar.gz

As suggested by Noah Levitt @ internet archive.

machawk1 updated 7 years ago
2
cheng10/WARC-Portal #41

MemorryError when calculate tf_idf with large warc file

![image](https://cloud.githubusercontent.com/assets/10646050/20856432/c334272e-b8cb-11e6-8dfa-cbd0a5e92e32.png)

cheng10 updated 7 years ago
3
eugeneware/warc #1

Unable to parse ClueWeb09

Hi, I'm trying to read the ClueWeb09 warc file but there is not data emitted nor error. It seems that ClueWeb's separator is different from standard warc files, I have forked this repository for [chan…

dod91 updated 8 years ago
1
AlexGustafsson/larch #16

Payloads in compressed WARCs are not lazily readable

Although WARCs created by Larch has support for streams, we're currently unable to use it for the server. The issue is this: 1. We use a ReadSeeker to be able to scrub in a stream 2. If the arc…

AlexGustafsson updated 3 years ago
2
VIDA-NYU/ache #148

Separate out downloaded pages into different (warc) files

Is there a config option for splitting out downloaded files into their own warc files instead of going into the same one? This will allow for easier data extraction based on individual items

DanAbbz92 updated 6 years ago
3

上一页 1...24 25 26 27 28 29 30...100 下一页

1000+ results for warc

1000+ results
for warc