Implement IO Reader classes

GoogleCodeExporter commented 8 years ago

Currently the Tart I/O library only has implementations for writer classes, not 
readers.

Original issue reported on code.google.com by viri...@gmail.com on 1 Mar 2011 at 8:44

GoogleCodeExporter commented 8 years ago

Patch attached that implements readLn() and readAll().

A couple notes:
- I added an elements() getter on ArrayList...not sure if this is appropriate
- readAll on stdin is obviously not that useful, but implemented anyway 
assuming it will be transferrable to some other implementation.
- would be nice to be able to add()/insert() an array of the appropriate type 
to an ArrayList - if that makes sense, I can try to implement
- if the IO error codes had negative values, some of the extern routines could 
return negative values to indicate a specific failure type
- I added readLn() and readAll() into the tests just to show they work, but 
they won't be usable in the normal automated test runner since they require 
input from the console

Original comment by lst...@gmail.com on 2 Apr 2011 at 9:40

Attachments:

stdfile_read.diff

GoogleCodeExporter commented 8 years ago

Woops - bug in readAll().  Updated patch.

Also: the following code seems to fail, with the message "Expected identifier, 
not LParen":

for (var i:int = 0; i < 10; i++) {

}

Is that a known issue, or am I doing something wrong?

Original comment by lst...@gmail.com on 2 Apr 2011 at 10:07

Attachments:

stdfile_read.diff

GoogleCodeExporter commented 8 years ago

Try it without the 'var' - it's implicit in the for statement. Also see if
removing the parens helps. (They are supposed to be optional.)

Original comment by viri...@gmail.com on 3 Apr 2011 at 6:08

GoogleCodeExporter commented 8 years ago

file as part of the unit test.

In terms of the elements() - I need to think of a better way to do efficient
transfers from one sequential container to another, without requiring you to
copy the elements one at a time. I had started to sketch out a solution in
the form of the class Memory.Buffer, but that has some fatal flaws. (Because
objects can move around, the first/last pointers in Buffer won't be
updated.) But the basic idea is that you have an object that represents a
contiguous sequence of objects in memory, and you have an object that
represents a read-only "view" of some subrange of that sequence. The object
also includes an object reference to the outer object that contains the
sequence, so that the sequence doesn't get garbage collected.

But yes, I was thinking about an "addAll" / "insertAll" method.

Original comment by viri...@gmail.com on 3 Apr 2011 at 6:16

GoogleCodeExporter commented 8 years ago

Was the first part of your comment cut off, perhaps?  I wasn't sure what you 
meant by "file as part of the unit test."

In that for loop, if I remove both the 'var' and the parens from the for loop, 
it works.

With respect to elements(), how do the Memory.arrayCopy/arrayMove routines fit 
into this?  Do they do one-at-a-time copies internally, and you're hoping to 
optimize that? Playing around, the following seems to work fine:

def addAll(ea:Array[ElementType]) {
    let index = _size;
    grow(ea.size);
    ElementType[].copyElements(self._data, index, ea, 0, ea.size);
  }

though after a quick search through the lang ref docs, I couldn't find info on 
whether structs and arrays are passed by value or reference when calling 
functions.  A struct is listed as a value type, so I would guess passed by 
value, but that seems like it could be pretty expensive for large structs - 
same with large arrays.

Original comment by lst...@gmail.com on 3 Apr 2011 at 5:41

GoogleCodeExporter commented 8 years ago

Comments below.

I meant, that the unit test should read from a file rather than from the
console.

OK that's a bug.

OK I have been working on this. I've gone ahead and added 'addAll' and
'insertAll' to ArrayList and Array, however it works with all Collections,
not just arrays.

What I've done is added a new interface:

/** Interface used to provide efficient copying of elements from one
container to another.
    Generally implemented by containers that store elements contiguously in
memory. */
interface Copyable[%ElementType] {
  /** Copy the elements of this collection to a destination address. */
  def copyElements(dstAddr:Address[ElementType], srcOffset:int = 0,
count:int = int.maxVal);
}

The idea is to allow a source collection to copy data directly to a target
collection. The nice thing about this interface is that it doesn't expose
any addresses from the source collection, and in fact I've gotten rid of a
number of places where internal addresses we being exposed. You see, once
you have the address of something, you can start to modify it, even if the
container is otherwise immutable. Also, type Address is not
garbage-collection safe, because the thing pointed to might not be an
object. What I've been doing up to this point is marking such methods as
@Unsafe, but I'd rather not have them at all.

Array, ArrayList, and ImmutableList now support the Copyable interface. So
now when you want to create an Array from a collection, you call
Array.copyOf(collection), which internally does something like this:

    let result:Array = alloc(size);
    match collection as copyable:Copyable[ElementType] {
      copyable.copyElements(result.data, 0, size);
    } else {
      var index = 0;
      for el in collection {
        result._data[index++] = el;
      }
    }

In other words, if the collection supports Copyable, it does a direct
transfer of the elements; Otherwise, it copies them one at a time.

Note that class String doesn't support Copyable, because String is supposed
to look like a sequence of characters, but internally it's a sequence of
bytes, and it would be inefficient to try and support the methods of
Collection and Copyable.

(Also, I need to think of a better name than "Copyable". What it really
means is being able to bulk copy elements from one container to another.)

braces, like []) which is a class; Or whether you are talking about a native
array, which can't exist by itself, but can only exist within some other
object, either a class or struct.

Original comment by viri...@gmail.com on 3 Apr 2011 at 6:38

GoogleCodeExporter commented 8 years ago

There's no design in place for opening and closing files, which is why I didn't 
attempt to add a file into the test suite.  Is it your intention that 
StdFileStream will provide the main File implementation in the stdlib?  I was 
assuming that it was for std streams only given its name, but perhaps that's 
not the case.  

Copyable actually looks OK from a naming perspective, I think.

With regard to the initial question about elements(), does returning the 
backing array of an ArrayList result in a copy, or just a reference to it?  If 
by ref, is there any way to specify that the reference should be read-only?

BTW - any interest in inhabiting an IRC channel to discuss any of these things 
in real time?

Original comment by lst...@gmail.com on 3 Apr 2011 at 7:53

GoogleCodeExporter commented 8 years ago

Ah - I think I misread or overlooked your answer about arrays.  Sounds like 
it's by ref, but would still be interested to know if there's a way to make it 
read-only.

Original comment by lst...@gmail.com on 3 Apr 2011 at 7:55

GoogleCodeExporter commented 8 years ago

I was assuming 'std' was for 'stdio', not 'stdin/stdout' etc.

Copyable actually looks OK from a naming perspective, I think.

Anyway, it's all checked in now.

For read-only arrays, use ImmutableList. Right now it just wraps an array,
but more optimization is planned.

Original comment by viri...@gmail.com on 3 Apr 2011 at 9:25

GoogleCodeExporter commented 8 years ago

I think there are 2 different topics getting conflated here :)

With respect to my read-only questions, I'm interested in this new elements() 
routine in ArrayList that returns a reference to the internal backing array - 
case in point in StdFileStream, I'd like to incrementally add ubytes to an 
ArrayList, then pass the internal array of ubytes to the Codec decode routine.  
In my patch I have this addition, but I was thinking that it would be nice if 
it were possible to enforce read-only semantics on the reference returned, such 
that the array couldn't be modified behind the back of the ArrayList.

In any event, I'll put together some open and close routines to read from an 
actual file, and submit another patch.

Original comment by lst...@gmail.com on 3 Apr 2011 at 10:12

GoogleCodeExporter commented 8 years ago

Ah, I see. Let me think about how to accomplish that.

Original comment by viri...@gmail.com on 3 Apr 2011 at 11:10

GoogleCodeExporter commented 8 years ago

OK so there's a couple of possibilities. Let's start by stating the general
goals:

   - We want to be able to do efficient encode / decode / read / write /
   copy operations on buffers, which involves being able to operate on the
   whole buffer rather than element-at-a-time access.
   - At the same time, however, we don't want end users mucking about with
   the internals of ArrayList and the like. It makes it much harder to reason
   about the correctness of programs if users can easily bypass encapsulation
   and private/protected access limits.

In the case of Copyable, the interface is set up such that the only time you
can get a pointer to the internal buffer is when it's going to be written to
anyway. In other words, a given container pulls data from another container
by supplying a buffer pointer, and that buffer pointer is never exposed at
any other time.

Unfortunately, we can't use the same pattern for decoding and reading
because those are "push" rather than "pull" operations ("pull" being the
case where the destination object is making the call to get the data.)

I did a code search and it looks like the file streams and the codecs are
the only places currently in the code which require exposing internal
pointers, since there's no other easy way to bulk transfer data at the
moment.

Now, in the case of read lines, one sort of brute-force approach would be to
use an Array rather than an ArrayList, and handle the case when the array
gets full - that is, if you get to the end of the array, and you haven't
seen a newline yet, then decode the characters you have read so far, empty
the array, and keep reading. The problem here is what to do about incomplete
UTF-8 characters at the end of the buffer that got chopped in the middle. To
handle that case, we would need to add a method to Codec that counts the
number of whole characters in a buffer and returns the number of bytes. Once
we know how many bytes to decode, then we decode those bytes, and take the
leftover piece and copy it down to the start of the array and continue
reading at that point. Of course, that means making two passes over the
buffer, one to count bytes and one to decode them. While this could be
avoided by making a decode function that does both, and returns a tuple of
(bytes read, characters written), it gets complicated because then you get
into issues of making sure the character buffer is large enough, etc.

I think that eventually I want to make a stream-based encoder/decoder that
operates on an input and output buffer, and returns when either the input is
empty or the output is full, and saves its state in a small struct (in fact,
you can see a rough sketch of that struct in Codec.tart.) You then fill the
input or flush the output and tell the stream decoder to continue where it
left off.

A more radical approach is to have some sort of DataSource interface that
both IOStreams and Codecs implement. The various collection classes, such as
arrays and array lists, would have a method that tells them to get content
from a DataSource. The collection class passes it's internal buffer pointer
to the DataSource, just like with Copyable, and keeps its buffer pointer
private the rest of the time.

In any event, I'll put together some open and close routines to read from an

Original comment by viri...@gmail.com on 4 Apr 2011 at 12:59

GoogleCodeExporter commented 8 years ago

OK, thinking more about this - 'elements' is fine for now. I'm making things
too complicated.

Original comment by viri...@gmail.com on 4 Apr 2011 at 5:32

GoogleCodeExporter commented 8 years ago

Yeah - lots of interesting thoughts :)

Is there any notion of const/read-only in tart currently?  That could 
potentially be a quite universally interesting approach...

Original comment by lst...@gmail.com on 4 Apr 2011 at 5:39

GoogleCodeExporter commented 8 years ago

Not currently - it's actually a very complex issue.

Original comment by viri...@gmail.com on 4 Apr 2011 at 5:44

GoogleCodeExporter commented 8 years ago

I can only imagine.  Will stick with elements() for now, with a view to perhaps 
making it const somehow in the future

Original comment by lst...@gmail.com on 4 Apr 2011 at 5:57

GoogleCodeExporter commented 8 years ago

Another thing I wanted to mention - at one point you compare the result of
readByte() with EOF, which isn't correct. EOF's value is 0xffffffff, which
won't fit into a byte. Worse, all byte values 0-255 are valid (not in UTF-8,
but in other character encodings.) Sentinel values like EOF only work with
character types, since the largest valid character is somewhere around
0x10ffff.

Original comment by viri...@gmail.com on 4 Apr 2011 at 6:45

GoogleCodeExporter commented 8 years ago

Right, makes sense - thanks.  This will be easier to test & verify once I'm 
running it on an actual file.

Original comment by lst...@gmail.com on 4 Apr 2011 at 7:45

GoogleCodeExporter commented 8 years ago

Been working on this a bit, and I have a working version including file 
reading, at long last (!).

The class organization that I have right now is essentially: IOStream --> 
InStream, OutStream --> FileInStream, FileOutStream

The idea is that in order to keep things as simple as possible, InStream and 
OutStream can do both text and binary IO. I imagine FileInStream and 
FileOutStream will rarely be instantiated directly (in fact, perhaps make this 
not possible?) and that usage would rather look like creating a File object, 
and getting the InStream from it: File("/my/path").in() -> InStream. Up for 
debate, of course.

I haven't touched StdFileStream, but it could effectively be ripped out at this 
point, as it's unused as of these changes. Please don't consider this a final 
patch - there are several remnants of debugging in there still, but I just 
wanted to attach something to show the general direction. Let me know what you 
think.

Original comment by lst...@gmail.com on 25 Apr 2011 at 1:04

Attachments:

io_reorg.diff

GoogleCodeExporter commented 8 years ago

OK I've finally gotten around to this.

I just checked in some changes that will make it easier to write this code
more efficiently, and without requiring the stream classes to
have privileged access to ArrayList. Basically what I did was I changed the
API for the text Codecs. The new API returns a struct which tells you how
far it got before it stopped processing characters.

In other words, when you call encode or decode, it will process as many
characters as it can, and there are 3 things that will cause it to stop:

   1. It got to the end of the input.
   2. It ran out of space in the output buffer.
   3. It encountered an error.

The struct that is returned contains the source index, destination index,
and error code.

One important consequence of this change is that the codecs can now handle
incomplete characters. Say for example you have a fixed-length buffer of 128
bytes. You read in 128 bytes of data. Let's say that the last two bytes of
the buffer are part of a 4-byte UTF-8 character. In other words, it's a
multi-byte character that got chopped in half by the read operation.

The new codecs can handle this situation just fine - when you call decode(),
it will stop when it gets to that 'partial' character. The encoding result
will show a source index of 126 rather than 128, so you know that there's 2
bytes left over in the buffer. What you do then is to copy those two bytes
down to the start of the buffer, and then start the next read operation
starting at an offset of 2. This will read in the remaining 2 bytes of the
4-byte sequence, plus whatever text comes after that.

What this basically means is that you don't need to use an ArrayList to
store the bytes you read from the file, you can just use a fixed length
array, and decode the array contents whenever it gets full. This is more
efficient anyway, as it avoids having to copy all of the bytes to a new
location every time the ArrayList grows. You *can* use an ArrayList[char] to
store the decoded characters if you want. (Another approach would be to have
a linked-list of fixed-length buffers, which are all concatenated together
at the end, so that each byte only gets copied once, but that's a premature
optimization IMHO. ArrayList is good enough for now.)

One thing that we should probably do as well is to handle the case where the
input bytes are already UTF-8 encoded, in which case there's no sense in
transforming them into a character array and then re-encoding them.
Although, again, this may be premature optimization.

I don't object to your idea of having InStream and OutStream handle both
text and byte-buffer i/o. I'm curious as to why you decided to remove the
'canSeek' and other stream properties. The idea was that if you are reading
from a socket or a pipe, that kind of stream won't support seeking, so it
could be useful to have some way to test this without actually trying it and
getting an UnsupportedOperationError.

I think that the Console should not expose the concrete type of the stream -
you should only get an interface. Different platforms may have different
implementations of the Console streams.

For streams that read from a stdio FILE* (which could be a socket or
something), if you don't like the name StdFileStream, another possible name
is PosixStream - since fread() and such functions are defined in the POSIX
standard.

Also, the thing with the UpCast serialization is checked in now.

Original comment by viri...@gmail.com on 3 May 2011 at 6:11

GoogleCodeExporter commented 8 years ago

Cool - sounds good. I'll update and adjust the patch to reflect those changes. 
I'll probably not tackle the pre-encoded input case for now, and perhaps just 
stick with ArrayList for decoding, and make a few TODO comments in there.

With respect to the properties (canSeek etc), I think I just set those aside 
during development and hadn't reintegrated them. Will definitely add those back.

Original comment by lst...@gmail.com on 3 May 2011 at 6:43

catb0t / tart

Implement IO Reader classes #17