VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
454 stars 135 forks source link

Separate out downloaded pages into different (warc) files #148

Open DanAbbz92 opened 6 years ago

DanAbbz92 commented 6 years ago

Is there a config option for splitting out downloaded files into their own warc files instead of going into the same one?

This will allow for easier data extraction based on individual items

aecio commented 6 years ago

Do you mean storing the WARC record of each URL in a single file? No. But you could try to set the maximum size (in bytes) for each file using:

target_storage.data_format.warc.max_file_size: 262144000

Setting a small enough size would force 1 page per WARC file. That being said, I wouldn't recommend this since you may run into file system problems on large crawls.

Another option is to use the FILESYSTEM data formats. They do create one file per URL, but they don't support the WARC format as yet.

DanAbbz92 commented 6 years ago

Ah, thanks or the update @aecio and yes, a WARC per relevant URL.

Does that mean the FILESYSTEM data format is planned to support WARC files in the future?

aecio commented 6 years ago

No, it is not planed, but it could be included.