cea-hpc / clustershell

Scalable cluster administration Python framework — Manage node sets, node groups and execute commands on cluster nodes in parallel.
https://clustershell.readthedocs.io/
425 stars 85 forks source link

non-line buffering #309

Open noyez opened 8 years ago

noyez commented 8 years ago

From version

$ git status
HEAD detached from v1.7.1

Perhaps this is an abuse of CLUSH, but the line-buffering causes segments of output of binary files to be missing. The genesis of this problem arose when i was collecting logs (using tar) from several nodes using the python API and when trying to do a database dump. As a simple test the following command also failed when tarring up directories on a single node. (A ssh version of this line to a single node produces the correct tar file.)

    $ clush -N -w test.localhost.lan tar -cz /path_to_binary_data_on_rhost/ -f - > /tmp/clush_test.tar
    $ file /tmp/clush_test.tar
    /tmp/clush_test.tar: data

As far as i could tell (and from the comments in the code, thank you!) the line buffering causes some of the data to go missing on the transfer at EngineClient.py:EngineClient::_readline(), see the following.

    def _readlines(self, sname):
        """Utility method to read client lines."""
        # read a chunk of data, may raise eof
        readbuf = self._read(sname)
        assert len(readbuf) > 0, "assertion failed: len(readbuf) > 0"

        # Current version implements line-buffered reads. If needed, we could
        # easily provide direct, non-buffered, data reads in the future.

        rfile = self.streams[sname]

        buf = rfile.rbuf + readbuf
        lines = buf.splitlines(True)
        rfile.rbuf = ""
        for line in lines:
            if line.endswith('\n'):
                if line.endswith('\r\n'):
                    yield line[:-2] # trim CRLF
                else:
                    # trim LF
                    yield line[:-1] # trim LF
            else:
                # keep partial line in buffer
                rfile.rbuf = line
                # breaking here

I was able to transfer smaller binary files by altering line EngineClient.py:388 to read: rfile.rbuf = rfile.rbuf + line Because if this for-loop is iterated more than once, and no line ending is found, the rfile.buf gets overwritten at the second iteration. Although, this work-around didn't help with any larger files.

I've dug into this issue as much as i can for now, but i wanted to make note of it in case the project has plans to allow for non-line buffering or some kind of fixed blob buffering switch.

degremont commented 8 years ago

Indeed binary support is not there yet for output as internally, everything is line oriented so far.

There is plan to drop this limitation in the future and this is tracked under ticket #20