Open riking opened 6 years ago
I think that CDX functionality doesn't really fit well in this package, so I designed a different interface. How does this look?
type flusher interface {
Flush() error
}
// Writer provides functionality for writing WARC files in compressed and
// uncompressed formats.
//
// To construct a Writer, call NewWriterCompressed or NewWriterRaw.
type Writer struct {
seekW io.WriteSeeker
w io.Writer
// RecordCallback will be called after each record is written to the file.
// If a WriteSeeker was not provided, the provided positions will be
// invalid.
RecordCallback func(r *Record, startPos, endPos int64)
}
// NewWriterCompressed initializes a WARC Writer writing to a compressed
// stream. The first parameter should be the "backing stream" of the
// compression. The second parameter must implement interface{Flush() error},
// which should establish a "checkpoint" in the compressed stream - a place
// where decompression can be resumed partway through, so individual records
// can be retrieved from the compressed file.
//
// Seek will only be called with whence == io.SeekCurrent and offset == 0.
//
// See also CountWriter() if you need a "fake" Seek implementation.
func NewWriterCompressed(rawFile io.WriteSeeker, cmprsWriter io.Writer) (*Writer, error) {}
// NewWriterRaw initializes a WARC Writer writing to an uncompressed stream.
// If the provided Writer implements io.Seeker, the RecordCallback will be
// available. If the provided Writer implements interface{Flush() error}, it
// will be flushed after every written Record.
func NewWriterRaw(w io.Writer) (*Writer, error) {}
And a CountWriter utility for e.g. writing to a net.Conn:
type countWriter struct {
count int64
w io.Writer
}
// CountWriter implements a limited version of io.Seeker around the provided
// Writer. It only supports offset == 0 and whence == io.SeekCurrent or
// io.SeekEnd, and returns the current number of written bytes in both cases.
func CountWriter(w io.Writer) io.WriteSeeker {
return countWriter{count: 0, w: w}
}
// implements io.Writer
func (c *countWriter) Write(p []byte) (int, error) {
n, err := c.w.Write(p)
if n >= 0 {
c.count += n
}
return n, err
}
var errCountWriterNotImplemented = stdErrors.New("unsupported seek operation")
// implements io.Seeker
func (c *countWriter) Seek(offset int64, whence int) (int64, error) {
if offset != 0 || !(whence == io.SeekCurrent || whence == io.SeekEnd) {
return errCountWriterNotImplemented
}
return c.count, nil
}
update: reading more of the gzip stuff, I think Flush is not sufficient - it needs a Close / Reset.
Thx for the update @riking, I'm hoping to take some time this weekend to sit down with your proposed interface change & understand your use case. Hopefully I'll be able to add constructive input, as this sounds like another exciting update!
The package should provide facilities to write warc.gz and CDX file pairs, and to append to already existing WARC/CDX pairs (see wpull --warc-append). Should also support uncompressed WARC files with uncompressed CDX size/offsets.
This issue is to discuss interface requirements.
Identified requirements:
WriteRecord() would go something like: write record to *writer, Flush the *writer, grab the file offsets and save into CDX