bzaar / DawgSharp

DAWG String Dictionary in C#
http://www.nuget.org/packages/DawgSharp/
GNU General Public License v3.0
118 stars 18 forks source link

EndOfStreamException: Unable to read beyond the end of the stream. #28

Closed JoshuaCode closed 4 years ago

JoshuaCode commented 4 years ago

I am trying to store a List in the value field. I am using the SaveTo overload which excepts a custom action to serialize the type. The file is successfully written. When I try to read in the file using the overload of the Load method I get the exception:

System.IO.EndOfStreamException: 'Unable to read beyond the end of the stream.'

I know the List payload is successfully deserialized in the Func as I can break before returning and see the List containing the expected data.

I am able to reproduce the exception in Visual Studio 16.6.0 Preview 2 and LinqPad 6 on Windows 10 machine.

private static void Main(string[] args)
        {
            var dawgBuilder = new DawgBuilder<List<string>>(); 
            dawgBuilder.Insert("test", new List<string> { "test", "hello", "world" });
            var dawg = dawgBuilder.BuildDawg(); 

            using (var file = File.Create(@"E:\Data\Dawg\DAWG-Test.bin"))
                dawg.SaveTo(file, new Action<BinaryWriter, List<string>>((r, payload) =>
                {
                    byte[] bytes = null;
                    BinaryFormatter bf = new BinaryFormatter();
                    using (MemoryStream ms = new MemoryStream())
                    {
                        bf.Serialize(ms, payload);
                        bytes = ms.ToArray();
                    }
                    r.Write(bytes, 0, bytes.Length);
                }));

            dawg = Dawg<List<string>>.Load(File.Open(@"E:\Data\Dawg\DAWG-Test.bin", FileMode.Open), new Func<BinaryReader, List<string>>(r =>
            {
                List<string> result = null;
                using (var ms = new MemoryStream())
                {
                    r.BaseStream.CopyTo(ms);
                    BinaryFormatter bf = new BinaryFormatter();
                    ms.Seek(0, SeekOrigin.Begin);
                    result = (List<string>)bf.Deserialize(ms);
                }
                return result;
            }));
        }
bzaar commented 4 years ago

This line:

r.BaseStream.CopyTo(ms);

copies the whole file to the MemoryStream (ms) so by the time your lambda is called for the second item, there is no input left in the file stream.

JoshuaCode commented 4 years ago

Thank you for your response. I haven't worked much with binary serialization before so it took me some time to arrive at a solution. The code now writes the length of the byte array for each serialized object first then the byte array. Then on the read it gets the length of the object and reads that number of bytes. The working code is below for anyone else that comes across this issue.

Your library is fantastic. I am new to Tries and DAWG's and came across them while researching how to implement autocomplete.

private static void Main(string[] args)
        {
            var dawgBuilder = new DawgBuilder<List<string>>(); // <bool> is the value type.
                                                               // Key type is always string.
            dawgBuilder.Insert("test1", new List<string> { "test1", "hello1", "world1" });
            dawgBuilder.Insert("test2", new List<string> { "test2", "hello2", "world2" });
            dawgBuilder.Insert("test3", new List<string> { "test3", "hello3", "world3" });
            dawgBuilder.Insert("test4", new List<string> { "test4", "hello4", "world4" });
            var dawg = dawgBuilder.BuildDawg(); // Computer is working.  Please wait ...

            using (var file = File.Create(@"E:\Data\Dawg\DAWG-Test.bin"))
                dawg.SaveTo(file, new Action<BinaryWriter, List<string>>((r, payload) =>
                {
                    byte[] bytes = null;
                    BinaryFormatter bf = new BinaryFormatter();
                    using (MemoryStream ms = new MemoryStream())
                    {
                        bf.Serialize(ms, payload);
                        bytes = ms.ToArray();
                    }
                    r.Write(bytes.Length);
                    r.Write(bytes, 0, bytes.Length);
                }));

            dawg = Dawg<List<string>>.Load(File.Open(@"E:\Data\Dawg\DAWG-Test.bin", FileMode.Open), new Func<BinaryReader, List<string>>(r =>
            {
                List<string> result = null;
                int objectSize = r.ReadInt32();
                using (var ms = new MemoryStream())
                {
                    byte[] buffer = new byte[objectSize];
                    int count = r.Read(buffer, 0, objectSize);
                    ms.Write(buffer, 0, count);
                    BinaryFormatter bf = new BinaryFormatter();
                    ms.Seek(0, SeekOrigin.Begin);
                    result = (List<string>)bf.Deserialize(ms);
                }
                return result;
            }));

            foreach (var node in dawg)
            {
                foreach (var value in node.Value)
                {
                    Console.WriteLine(value);
                }
            }
        }
bzaar commented 4 years ago

Thanks for sharing your code.

Just so you know, having unique values for TPayload (as in your example code) reduces your DAWG to a Trie. Which might be okay but you probably won't get the same compression ratios as me (I have managed to get to one byte per word).

JoshuaCode commented 4 years ago

I am working on an application that needs Google style autocomplete when searching for a person. Our current approach is to go to the database and search using beginsWith for each of the search words. Something like

firstName like (:searchWord1 || '%') and lastName like (:searchWord2 || '%')

I would like to get rid of the database round trip and use an in-memory data structure. Based on my research, some references are below, using a Trie or even better a DAWG was the recommended approach. My initial approach right now is to store a list in TPayload containing the name and id for each person matching the search word. Here is a simple example

dawgBuilder.Insert("michael", new List<(string Id, string Name)>() { ("1", "Michael Abbott"), ("2", "Michael Jones"), ("3", "Michael Smith") });

This allows me to build the list of results to display on screen and have the key to look up the person with once the user selects one of the suggested results. I am sure there are better approaches. Any suggestions you have would be much appreciated.

References: