I found that current CommonCrawler implementation returns wrong result if a warc file contains binary parts.
It happen beacause of bufio.Scanner: according to golang documentation: 'Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer." And I think its the third case. One can check this with following commands: "cat -v path_to_file | grep some_word" vs "cat path_to_file | grep some_word" on any warc file with large binary sections.
I found that current CommonCrawler implementation returns wrong result if a warc file contains binary parts.
It happen beacause of bufio.Scanner: according to golang documentation: 'Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer." And I think its the third case. One can check this with following commands: "cat -v path_to_file | grep some_word" vs "cat path_to_file | grep some_word" on any warc file with large binary sections.