PiRSquared17 / warc-tools

Automatically exported from code.google.com/p/warc-tools
0 stars 0 forks source link

SRS 65 — It shall be possible for libwarc to be able to handle WARC file of any size, with minimal memory usage. #72

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
SRS 65 — It shall be possible for libwarc to be able to handle WARC file of
any size, with minimal memory usage.

Original issue reported on code.google.com by gordon.p...@gmail.com on 27 Jul 2008 at 10:23

GoogleCodeExporter commented 9 years ago
warcdump -- speed & memory use:

Quick tests comparing the speed of warcdump on a 1.1GB warc file:

Command "time cat five.warc > /dev/null" 
Result: takes about 20 seconds / 2.9K memory (measured by top).

Command "time ../warc-tools-read-only/app/warcdump -f five.warc > /dev/null"
Result: takes about 2.6 seconds / 1.8K memory (measured by top).

Memory use appears to be constant by file size.

arc2warc:

Quick tests on arc2warc reveal constant amount of memory used even for larger 
ARC
files (1GB file converted over about 1:11 uses same memory throughout, less 
than 2K
in top). Same goes for 5.5GB file.

Original comment by gordon.p...@gmail.com on 7 Nov 2008 at 3:32

GoogleCodeExporter commented 9 years ago
Tried to convert 5.5GB arc file to warc with the command:

../warc-tools-read-only/app/arc2warc -a six.arc.gz -f siz.warc

but command halted with this error:

> debug: lib/private/wfile.c :1971:"couldn't add record to the warc file, 
maximum
size reached"

Original comment by gordon.p...@gmail.com on 7 Nov 2008 at 3:39

GoogleCodeExporter commented 9 years ago
Younes -- Could you let us know what size is the maximum size for this command, 
and
does it apply to others?

Original comment by gordon.p...@gmail.com on 7 Nov 2008 at 3:40

GoogleCodeExporter commented 9 years ago
Reopened, discussing with Younes as there is a maximum WARC size set in the 
code.

Younes wrote:

Actually, we're using a warc_u32_t (i.e. 32 bits unsigned int =  
4,294,967,295 = 4Gb of length) to handle WARC file.
We tought that's a good strategy to let you think and avoid having big  
WARCs. Something between
100 Mo and 600 Mo (and even 1Go) is a good choice in my opinion  
(minimize the risk of data loss, pretty fast data copying ...).
This is what I.A, Hanzo and others use in general.

In the file "app/arc2warc.c", you can find a 32 bits integer constant  
called :

#define WARC_MAX_SIZE 1629145600

This is your limit actually. You can increase this value to 4Gb at max  
and try again:

#define WARC_MAX_SIZE 4294967296

Original comment by gordon.p...@gmail.com on 12 Nov 2008 at 12:07

GoogleCodeExporter commented 9 years ago

Original comment by gordon.p...@gmail.com on 12 Nov 2008 at 12:08

GoogleCodeExporter commented 9 years ago

Original comment by gordon.p...@gmail.com on 12 Nov 2008 at 12:08

GoogleCodeExporter commented 9 years ago
From Younes new release message today:

* Support for large WARC files up to 18 Exa Byte size.
* Support for large WARC records up to 18 Exa Byte size each.

Original comment by gordon.p...@gmail.com on 17 Nov 2008 at 11:00

GoogleCodeExporter commented 9 years ago
Repeated test on 5.5GB ARC:

../warc-tools-read-only/app/arc2warc -a six.arc.gz -f six.warc

This works fine, output as follows:
-rw-r--r-- 1 paynter paynter 6495613350 2008-11-18 12:14 six.warc

Original comment by gordon.p...@gmail.com on 18 Nov 2008 at 12:40