Open theonewolf opened 10 years ago
Looks like there is corrupted binary data on the block boundaries. This could account for the missing data.
Related to this, when file size increases we try and pull the last block and ship its data out (as it may never be written again). We appear to be in some cases pulling the wrong block/data:
-------------------
Message on Channel: blizzard:485bc61a-e7f3-4a1b-9383-e420046d969b:/home/wolf/scratch
field : file.size
new : 4120
old : 4083
transa : 8940
type : metadata
-------------------
Message on Channel: blizzard:485bc61a-e7f3-4a1b-9383-e420046d969b:/home/wolf/scratch
end : 4120
start : 4096
transa : 8940
type : data
write : #!/bin/bash
sudo apt-get
Also note that we might need to pull more than one block to see the new "valid" data, not just the final block.
When we cross block boundaries this is most visible: the final block changes to a new one, thus we might miss data at the end of the "previous" final block. We might need to go back through an arbitrary number of blocks to get data written that is newly associated with a file based on file size update.
This code needs to change to properly look up the blocks that are included in the file size update:
https://github.com/cmusatyalab/gammaray/blob/master/src/gray-inferencer/deep_inspection.c#L1435
There appears to be missing data bytes in the file update stream when doing tests tracking a log file (syslog with timestamps embedded).
First reported by @hsj0660.