Open sirex opened 8 years ago
Sure, I think that sounds great!
Is there anything update for this suggestion? I encountered the same problem when loading large data:
warcat/util.py", line 66, in find_file_pattern raise ValueError('Search for pattern exhausted') ValueError: Search for pattern exhausted
I am getting the same error as @ikbear, while running python -m warcat list
.
I was surprised that example provided in documentation:
Reads everything into memory. And there is no easy way to iterate over records without loading everything into memory.
In my case, WARC files takes gigabytes of space, so I want to process those files record by record without loading everything into memory.
After reading sources I came up with this helper function:
I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:
Also, if I could get lxml, BeautifulSoap and json from records, something like this:
Then it would be really amazing.
If you agree with suggested API, I can create pull request with the implementation.