chfoo / warcat

Tool and library for handling Web ARChive (WARC) files.
GNU General Public License v3.0
147 stars 21 forks source link

Reading in an in-memory gzip.GzipFile object breaks warcat.model.binary.BinaryFileRef objects #10

Closed d-m closed 8 years ago

d-m commented 8 years ago

The following:

byte_stream = io.BytesIO(r.content)
file_object = gzip.GzipFile(fileobj=byte_stream)
warc = warcat.model.WARC().read_file_object(file_object)
record = warc.records[0]
binary_block = record.content_block.binary_block.get_file()

results in an AttributeError in warcat.model.binary.BinaryFileRef:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-1319f0884b9c> in <module>()
----> 1 rec.content_block.binary_block.get_file()

/usr/local/lib/python3.5/site-packages/warcat/model/binary.py in get_file(self, safe, spool_size)
    128             file_obj = self.file_obj
    129 
--> 130         original_position = file_obj.tell()
    131 
    132         if self.file_offset:

AttributeError: 'NoneType' object has no attribute 'tell'

The same error also occurs with the Payload.get_file method. This seems to be because the BinaryBlock and BlockWithPayload classes' load method passes the file object's name directly to set_file on lines 40, 83, and 96 of warcat/model/block.py; changing these lines to pass in the file object itself instead of its name seems to work.

chfoo commented 8 years ago

I pushed a fix on the develop branch. If you can, could you verify that it is fixed? Thanks.

d-m commented 8 years ago

Thanks! This fixed things for my purposes. There is still an edge case if you define the GzipFile object with a name like so:

...
file_object = gzip.GzipFile('test', fileobj=byte_stream)
...

If you name the file, you end up with:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-19-f0659c79356d> in <module>()
----> 1 rec.content_block.payload.get_file().read() == warc_record.content_block.payload.get_file().read()

/usr/local/lib/python3.5/site-packages/warcat/model/binary.py in get_file(self, safe, spool_size)
    124                         gzip.GzipFile(self.filename))
    125                 else:
--> 126                     file_obj = open(self.filename, 'rb')
    127 
    128                 util.file_cache.put(self.filename, file_obj)

FileNotFoundError: [Errno 2] No such file or directory: 'test'

Looks like this can be fixed by swapping this if/else statement or by putting the in memory file in the cache.

chfoo commented 8 years ago

Ok, thanks. I'm going to put that edge case as a separate issue.