Closed nvanva closed 7 months ago
warc2text -f url,html --skip-text-extraction --classifier skip --url-filters non-existing-file -o tmp ~/hplt/one/warc/cc/CC-MAIN-2022-40/CC-MAIN-20220924151538-20220924181538-00000.warc.gz
results in all urls filtered out. Probably, it would be better to exit with an error if specified file doesn't exist.
warc2text -f url,html --skip-text-extraction --classifier skip --url-filters non-existing-file -o tmp ~/hplt/one/warc/cc/CC-MAIN-2022-40/CC-MAIN-20220924151538-20220924181538-00000.warc.gz
results in all urls filtered out. Probably, it would be better to exit with an error if specified file doesn't exist.