Open DanAbbz92 opened 6 years ago
Do you mean storing the WARC record of each URL in a single file? No. But you could try to set the maximum size (in bytes) for each file using:
target_storage.data_format.warc.max_file_size: 262144000
Setting a small enough size would force 1 page per WARC file. That being said, I wouldn't recommend this since you may run into file system problems on large crawls.
Another option is to use the FILESYSTEM data formats. They do create one file per URL, but they don't support the WARC format as yet.
Ah, thanks or the update @aecio and yes, a WARC per relevant URL.
Does that mean the FILESYSTEM data format is planned to support WARC files in the future?
No, it is not planed, but it could be included.
Is there a config option for splitting out downloaded files into their own warc files instead of going into the same one?
This will allow for easier data extraction based on individual items