warc Search Results - Githubissues

1000+ results
for warc

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

lintool/warcbase #260

WARCRecord NotSerializableException when trying to get rid o…

I try to get rid of duplicate pages as follows: ``` val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc) .keepValidPages() .groupBy(_.getUrl).values.map(_.head) // remove dup…

dportabella updated 7 years ago
1
go-shiori/shiori #353

Support Obelisk archiving

It seems that shiori depends on [warc](https://github.com/go-shiori/warc) which is currently archived. We need to find a replacement for warc. Maybe [obelisk](https://github.com/go-shiori/obelisk)? #…

fmartingr updated 8 months ago
12
facebookresearch/cc_net #45

Numerous Errors

Hello, Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into multiple errors. It seems as if the …

conceptofmind updated 1 year ago
2
WASAPI-Community/data-transfer-apis #3

Suggestions for /webdata endpoint, support for 'open' warcs

I wanted to offer some thoughts on the /webdata endpoint in general and some possible areas of improvement for supporting other services, such as Webrecorder. One issue that I see is the time for h…

ikreymer updated 7 years ago
3
ArchiveTeam/youtube-dislikes-grab #4

How to decompress zst files?

I'm currently trying to decompress [this archive file](https://archive.org/download/archiveteam_youtubedislikes_20211211123803_62132b09/youtubedislikes_20211211123803_62132b09.1638107855.megawarc.warc…

Myzel394 updated 2 years ago
5
webrecorder/pywb #182

Index WARC files on external storage

Being able to index and re-index collections that are located on remote storage (S3) would be very helpful.

despens updated 4 years ago
2
webrecorder/pywb #217

pywb timeouts on larger WARC file

We use pywb to serve a ~ 488 MB WARC file (https://webrecorder.io/layoutanalysis/2015_2016) to [scrapy](https://scrapy.org/)/[splash](http://splash.readthedocs.io/en/stable/), which injects a layout a…

fbuchinger updated 7 years ago
2
commoncrawl/news-crawl #34

Add HTTP protocol version to HTTP request message

The request records in the CC-NEWS WARC files lack the HTTP protocol version: ``` GET /path ``` instead of ``` GET /path HTTP/1.1 ``` This makes some WARC parsers fail to process the WARC fil…

sebastian-nagel updated 4 years ago
1
oduwsdl/ipwb #374

Does ipwb handle segmented response records?

The [WARC/1.1](http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1-1_latestdraft.pdf) spec (Section B.8) gives an example where a response record is segmented into multiple other smaller records. This c…

machawk1 updated 6 years ago
5
ukwa/webrender-api #9

Replace with browsertrix-crawler

Rather than our own `webrender-api`, consider switching to https://github.com/webrecorder/browsertrix-crawler The integration pattern is somewhat different to Browsertrix's primary use case, but it…

anjackson updated 3 years ago
2

上一页 1...15 16 17 18 19 20 21...100 下一页

1000+ results for warc

1000+ results
for warc