ChrisCates / CommonCrawler

🕸 A simple way to extract data from Common Crawl
MIT License
33 stars 12 forks source link

Error during parse binary warc package #6

Closed LastPossum closed 5 years ago

LastPossum commented 5 years ago

I found that current CommonCrawler implementation returns wrong result if a warc file contains binary parts.

It happen beacause of bufio.Scanner: according to golang documentation: 'Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer." And I think its the third case. One can check this with following commands: "cat -v path_to_file | grep some_word" vs "cat path_to_file | grep some_word" on any warc file with large binary sections.

LastPossum commented 5 years ago

I'll fix it by myself, it anyone wouldn't mind

ChrisCates commented 5 years ago

@LastPossum, thanks, man! Appreciate looking into those edge cases!