datatogether / warc

Golang WARC (Web ARChive) Library
GNU Affero General Public License v3.0
29 stars 7 forks source link

Interfaces to write warc.gz / CDX files #13

Open riking opened 6 years ago

riking commented 6 years ago

The package should provide facilities to write warc.gz and CDX file pairs, and to append to already existing WARC/CDX pairs (see wpull --warc-append). Should also support uncompressed WARC files with uncompressed CDX size/offsets.

This issue is to discuss interface requirements.

Identified requirements:

WriteRecord() would go something like: write record to *writer, Flush the *writer, grab the file offsets and save into CDX

riking commented 6 years ago

I think that CDX functionality doesn't really fit well in this package, so I designed a different interface. How does this look?

type flusher interface {
    Flush() error
}

// Writer provides functionality for writing WARC files in compressed and
// uncompressed formats.
//
// To construct a Writer, call NewWriterCompressed or NewWriterRaw.
type Writer struct {
    seekW io.WriteSeeker
    w     io.Writer

    // RecordCallback will be called after each record is written to the file.
    // If a WriteSeeker was not provided, the provided positions will be
    // invalid.
    RecordCallback func(r *Record, startPos, endPos int64)
}

// NewWriterCompressed initializes a WARC Writer writing to a compressed
// stream.  The first parameter should be the "backing stream" of the
// compression.  The second parameter must implement interface{Flush() error},
// which should establish a "checkpoint" in the compressed stream - a place
// where decompression can be resumed partway through, so individual records
// can be retrieved from the compressed file.
//
// Seek will only be called with whence == io.SeekCurrent and offset == 0.
//
// See also CountWriter() if you need a "fake" Seek implementation.
func NewWriterCompressed(rawFile io.WriteSeeker, cmprsWriter io.Writer) (*Writer, error) {}

// NewWriterRaw initializes a WARC Writer writing to an uncompressed stream.
// If the provided Writer implements io.Seeker, the RecordCallback will be
// available.  If the provided Writer implements interface{Flush() error}, it
// will be flushed after every written Record.
func NewWriterRaw(w io.Writer) (*Writer, error) {}

And a CountWriter utility for e.g. writing to a net.Conn:

type countWriter struct {
    count int64
    w     io.Writer
}

// CountWriter implements a limited version of io.Seeker around the provided
// Writer.  It only supports offset == 0 and whence == io.SeekCurrent or
// io.SeekEnd, and returns the current number of written bytes in both cases.
func CountWriter(w io.Writer) io.WriteSeeker {
    return countWriter{count: 0, w: w}
}

// implements io.Writer
func (c *countWriter) Write(p []byte) (int, error) {
    n, err := c.w.Write(p)
    if n >= 0 {
        c.count += n
    }
    return n, err
}

var errCountWriterNotImplemented = stdErrors.New("unsupported seek operation")

// implements io.Seeker
func (c *countWriter) Seek(offset int64, whence int) (int64, error) {
    if offset != 0 || !(whence == io.SeekCurrent || whence == io.SeekEnd) {
        return errCountWriterNotImplemented
    }
    return c.count, nil
}
riking commented 6 years ago

update: reading more of the gzip stuff, I think Flush is not sufficient - it needs a Close / Reset.

b5 commented 6 years ago

Thx for the update @riking, I'm hoping to take some time this weekend to sit down with your proposed interface change & understand your use case. Hopefully I'll be able to add constructive input, as this sounds like another exciting update!