clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
Apache License 2.0
418 stars 101 forks source link

Resources need to be loaded faster #458

Closed kwalcock closed 3 years ago

kwalcock commented 3 years ago

We need to quit loading resources as text and parsing them. Even Java serialization can be 10 times faster. However, that serialization is rather brittle. There are lots of options to choose from. Here's an evaluation of some of them: https://github.com/eishay/jvm-serializers/wiki. Any resource that takes more than X seconds to load should be converted to use one of these serializers that we standardize on.

MihaiSurdeanu commented 3 years ago

Coifer seen me like a winner. Or protostuff

On Thu, Mar 18, 2021 at 15:45 Keith Alcock @.***> wrote:

We need to quit loading resources as text and parsing them. Even Java serialization can be 10 times faster. However, that serialization is rather brittle. There are lots of options to choose from. Here's an evaluation of some of them: https://github.com/eishay/jvm-serializers/wiki. Any resource that takes more than X seconds to load should be converted to use one of these serializers that we standardize on.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/458, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TXYYLZGRAJCCOCXGRDTEJ7B5ANCNFSM4ZNSHGDQ .

kwalcock commented 3 years ago

We're mostly concerned with read times and here they are for glove.840B.300d.10f:

format time (sec.)
text 34.76
bin 2.22
kryo 2.73
colf 3.10

Bin is for the simple Java serialization, however this is for the significantly optimized CompactWordEmbeddingMap and not some random, possibly circular object graph.

Write times aren't as important, but here they are:

format time (sec.)
bin 3.01
kryo 3.87
colf 2.76

File sizes don't differ significantly:

format file size (KB)
text 1,960,473
bin 918,192
kryo 918,192
colf 922,675

These times were all for reading from a file, not reading from a file that was embedded in a jar as a resource.

kwalcock commented 3 years ago

If only considering speed, there isn't justification to change from the standard Java serialization. I mentioned brittleness before. This may have been because the object serialization is dependent upon Java version and Scala version. We're changing Scala all the time. It seems like Java should be fairly forward compatible in that Java 11 should be able to read our Java 8 objects. Perhaps I should test that. In the simple class, it should be possible to convert all Scala objects to Java objects for serialization and not be dependent on Scala. For instance, the Option[Array[Float]] could be changed to a Java float[] or null and maybe I can check the serialized object to make sure there is nothing Scala about it.

MihaiSurdeanu commented 3 years ago

Well, converting to Java objects for serialization would slow down things even further, no? Unless we keep our data structures as Java ones all the time. I don't know, maybe using a JVM independent serialization makes sense for simplicity?

kwalcock commented 3 years ago

They are only String, Array[Float], Option[Array[Float]], and Int, so Java shouldn't be a problem. colf is language-independent, but for that reason more complex. One has to describe the structures in a different file and then translate the description via a separate executable into Java code which then gets converted into Scala in the same way as we would otherwise. For this simple structure it is not difficult. The main colfer project does not support Python, which is too bad.

MihaiSurdeanu commented 3 years ago

Ok then. Java it is!

On May 9, 2021 at 4:49:13 PM, Keith Alcock @.***) wrote:

They are only String, Array[Float], Option[Array[Float]], and Int, so Java shouldn't be a problem. colf is language-independent, but for that reason more complex. One has to describe the structures in a different file and then translate the description via a separate executable into Java code which then gets converted into Scala in the same way as we would otherwise. For this simple structure it is not difficult. The main colfer project does not support Python, which is too bad.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/458#issuecomment-835948042, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TQFHU3COQO3SE2TJALTM4NPRANCNFSM4ZNSHGDQ .

kwalcock commented 3 years ago

The one I hadn't tried yet was protobuf. It turns out that protobuf stores repeated elements, like arrays, in Java ArrayLists. ArrayLists can only store objects, not primitives. As a result, all 233,430,600 floats are getting boxed into objects and added one at a time to the list and have been for the last half an hour. It doesn't seem appropriate for the job. It calls its data structures "messages" and emphasizes compatibility across languages rather than compactness or speed. We've got a novel or tome rather than a message and only one or two languages to support. Kryo at least supports arrays directly and is better for this particular job.

kwalcock commented 3 years ago

Of the formats investigated, my recommendation for Glove serialization would be in this order:

  1. Kryo
  2. Java/Scala
  3. Colf
  4. Protobuf

The first two are a close call. We can make do with Java/Scala and not require this extra library, but Kryo has the benefits of

  1. It can be limited to Java without allowing accidental mixing in of a Scala class.
  2. It is more secure because of the class registration requirement. See code below.

It tries to hold its own against Java/Scala in that

  1. There is some Scala support (https://github.com/EsotericSoftware/kryo#scala).
  2. The speed is fairly close to Java/Scala in the Glove case.

Here's the Kryo version of loading Glove:

  def loadKryo(filename: String): BuildType = {
    val kryo = new Kryo()
    kryo.register(classOf[Array[Float]])

    new Input(new BufferedInputStream(new FileInputStream(filename))).autoClose { input =>
      val map = {
        val text = kryo.readObject(input, classOf[String])
        val map: ImplMapType = new ImplMapType()
        map ++= new SplitterIter(text)
      }
      val array = kryo.readObject(input, classOf[Array[Float]])
      val unknownArrayOpt = Option(kryo.readObjectOrNull(input, classOf[Array[Float]]))
      val columns = kryo.readObject(input, classOf[Int])

      BuildType(map, array, unknownArrayOpt, columns)
    }
  }

Here's the straight Java/Scala version

  def loadSer(filename: String): BuildType = {
    new ClassLoaderObjectInputStream(this.getClass.getClassLoader, new BufferedInputStream(new FileInputStream(filename))).autoClose { objectInputStream =>
      val map = {
        val text = objectInputStream.readObject().asInstanceOf[String]
        val map: ImplMapType = new ImplMapType()
        map ++= new SplitterIter(text)
      }
      val array = objectInputStream.readObject().asInstanceOf[Array[Float]]
      val unknownArrayOpt = Option(objectInputStream.readObject().asInstanceOf[Array[Float]])
      val columns = objectInputStream.readInt

      BuildType(map, array, unknownArrayOpt, columns)
    }
  }
BeckySharp commented 3 years ago

is one of the file types bigger than the other? also with the kryo one is the file human readable? fwiw i think @marcovzla was using kryo for odinson (the serialized dependencies) but moved to unsafe bc it’s faster (bc it’s custom, only does one thing). also i think the java unsafe was much faster but YMMV

MihaiSurdeanu commented 3 years ago
  1. I like the registration part of kryo! I would prefer this over Java/Scala serialization.
  2. I think the fact the serialized embeddings are not readable is not a big deal. The jar will contain the text ones too.
  3. Unsafe sounds intriguing. Is it hard to test it out Keith?
BeckySharp commented 3 years ago

You can look at the code for it here: https://github.com/lum-ai/odinson/blob/master/core/src/main/scala/ai/lum/odinson/serialization/UnsafeSerializer.scala I "helped" when he did this (mainly learned) and he made the point that bc it's custom, it will always be faster than a general purpose library like kryo. plus it makes you feel like you're living on the edge -- "unsafe" :)

MihaiSurdeanu commented 3 years ago

Nice. Pretty simple.

On Thu, May 13, 2021 at 7:47 AM Becky Sharp @.***> wrote:

You can look at the code for it here: https://github.com/lum-ai/odinson/blob/master/core/src/main/scala/ai/lum/odinson/serialization/UnsafeSerializer.scala

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/458#issuecomment-840611311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TRYMJVCDQFWLUG26RLTNPQ63ANCNFSM4ZNSHGDQ .

kwalcock commented 3 years ago

Kryo is not human readable. For Glove, file size is an issue and we probably can't have both. Having a single, standard library, at least for similar enough purposes, would be nice, so I should look at unsafe.

I can also read and write the 16-bit floats. Getting them integrated, though, will be more difficult. For every * and + they need to be converted to Float, so I suspect there will be a runtime penalty.

kwalcock commented 3 years ago

File sizes are compared in a table above. The three so far are about even and the 16-bit was about half that, as expected.

kwalcock commented 3 years ago

P.S. I think the text version should be in a different jar. The extra 2GB for the text version is too much to be dragging around unused.

BeckySharp commented 3 years ago

File sizes are compared in a table above. The three so far are about even and the 16-bit was about half that, as expected.

sorry I didn't notice it, thanks for the pointer

MihaiSurdeanu commented 3 years ago

+1 on separating the text files in a separate jar.

kwalcock commented 3 years ago

In that unsafe serialization, lots of integers are being written to a buffer, either singly with unsafe.putInt, or multiply with unsafe.copyMemory. How is endianness accounted for in the latter case? It seems like java.nio.ByteOrder would need to be consulted or a ByteBuffer used (e.g., ByteBuffer buffer = buffer.order(ByteOrder.LITTLE_ENDIAN)) if the bytes are eventually to be written to a jar file and shared with other computers as will be the case with the glove file. Is odinson is serializing for use only on the local computer?

BeckySharp commented 3 years ago

@marcovzla do you know?

marcovzla commented 3 years ago

Internally, Java uses Big Endian. As far as I know, the unsafe api does not change the endianness, so the serialized files are big endian too.

kwalcock commented 3 years ago

Google may be deceiving me. It isn't yet clear to me where the swapping takes place. It might be when the byte codes are being read from class files, during de/serialization, or with each integer operation, etc. Is there a definitive source that says bytes are stored in memory in the same order? There are several places like https://stackoverflow.com/questions/9431526/find-endianness-of-system-in-java that imply that the order in memory can change, but they aren't doing a thorough job of it.

marcovzla commented 3 years ago

Unfortunately the unsafe api is not documented. It makes sense to me that internally java stores data in the native order. Also, if you go this route then you will have to write the serialization yourself. That makes sense for odinson because there is a single object that we want to serialize. If you need a more general solution I would probably go with kryo.

marcovzla commented 3 years ago

Another thing to consider is that more recent versions of java provide alternatives to the unsafe api, like foreign memory access api.

kwalcock commented 3 years ago

I'm a little concerned about phrases like this: "On JVMs that have the sun.misc.Unsafe class".

MihaiSurdeanu commented 3 years ago

So, sounds like kryo is the simplest choice?

kwalcock commented 3 years ago

It won't take very long at all to try out unsafe and find out how many of those 3 seconds it can save.

BeckySharp commented 3 years ago

wait -- we're having all this conversation about something that takes 3 seconds? What is the rest of the initialization time?

On Thu, May 13, 2021 at 1:23 PM Keith Alcock @.***> wrote:

External Email

It won't take very long at all to try out unsafe and find out how many of those 3 seconds it can save.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/458#issuecomment-840811065, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJCPCNZFJISGK3R4LKQ3X3TNQYLHANCNFSM4ZNSHGDQ .

MihaiSurdeanu commented 3 years ago

This is what happens when you're not paying attention :) The 34 seconds is for the loading of text embeddings. The binary formats all take 2-3 seconds.

BeckySharp commented 3 years ago

well then poop. use kyro I'm so sorry all :( :( :( forgive me Keith!!

On Thu, May 13, 2021 at 1:36 PM Mihai Surdeanu @.***> wrote:

External Email

This is what happens when you're not paying attention :) The 34 seconds is for the loading of text embeddings. The binary formats all take 2-3 seconds.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/458#issuecomment-840818517, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJCPCP4JV5XUCM5NSW7FEDTNQZ4HANCNFSM4ZNSHGDQ .

kwalcock commented 3 years ago

In the unsafe serialization the little bytes are coming out first, so little endian, not the big endian that Java uses elsewhere. (If I understand that all correctly.) So, although those microprocessors may be far and few between, we should be able to account for the other arrangement IMO.

kwalcock commented 3 years ago

I timed the unsafe version and it does win for load speed:

format time (sec.)
text 36.26
bin 2.28
kryo 2.75
colf 2.84
unsafe 1.80

For completeness, here are the other tables for the relatively unimportant save

format time (sec.)
bin 3.21
kryo 4.48
colf 2.90
unsafe 6.34

and sizes

format file size (KB)
text 1,960,473
bin 918,192
kryo 918,192
colf 922,675
unsafe 918,192

BTW these were the averages of four runs, but I didn't check the distribution.

kwalcock commented 3 years ago

For glove, I guess I'd still choose Kryo. Not having the library dependency would be nice, but I'm also wary of the unsafe part. The compatibility between Java distributions and processors is worrisome and for structures much more complicated than this, it's going to be a hassle to code.

MihaiSurdeanu commented 3 years ago

I agree. Kryo seems the best choice for this.

On Sat, May 15, 2021 at 00:41 Keith Alcock @.***> wrote:

For glove, I guess I'd still choose Kryo. Not having the library dependency would be nice, but I'm also wary of the unsafe part. The compatibility between Java distributions and processors is worrisome and for structures much more complicated than this, it's going to be a hassle to code.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/458#issuecomment-841616785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TROFT7452B3H7II3PTTNYQTVANCNFSM4ZNSHGDQ .

kwalcock commented 3 years ago

Kryo has been incorporated.