Separate out downloaded pages into different (warc) files

DanAbbz92 commented 6 years ago

Is there a config option for splitting out downloaded files into their own warc files instead of going into the same one?

This will allow for easier data extraction based on individual items

aecio commented 6 years ago

Do you mean storing the WARC record of each URL in a single file? No. But you could try to set the maximum size (in bytes) for each file using:

target_storage.data_format.warc.max_file_size: 262144000

Setting a small enough size would force 1 page per WARC file. That being said, I wouldn't recommend this since you may run into file system problems on large crawls.

Another option is to use the FILESYSTEM data formats. They do create one file per URL, but they don't support the WARC format as yet.

DanAbbz92 commented 6 years ago

Ah, thanks or the update @aecio and yes, a WARC per relevant URL.

Does that mean the FILESYSTEM data format is planned to support WARC files in the future?

aecio commented 6 years ago

No, it is not planed, but it could be included.

VIDA-NYU / ache

Separate out downloaded pages into different (warc) files #148