IntersectMBO / plutus

The Plutus language implementation and tools
Apache License 2.0
1.57k stars 479 forks source link

Do not use lazy `ByteString`s for variable names #436

Closed effectfully closed 5 years ago

effectfully commented 5 years ago

The docs state:

A key feature of lazy ByteStrings is the means to manipulate large or unbounded streams of data without requiring the entire sequence to be resident in memory. To take advantage of this you have to write your functions in a lazy streaming style, e.g. classic pipeline composition. The default I/O chunk size is 32k, which should be good in most circumstances.

so we shouldn't use lazy ByteStrings for variables names, because names are small.

Moreover, we probably shouldn't use any pinned-memory-based data structures for name handling, because they cause memory fragmentation when there are many of them and we don't want to analyze whether we have "many" names or not, because it's error-prone.

Our best bet is probably ShortText (which is defined in terms of ShortByteString which is not pinned-memory-based). Related blog post.

vmchale commented 5 years ago

@mchakravarty is this pertinent given the use of Uniques?

Is there any evidence of the above issues occurring when performing profiling or benchmarking?

effectfully commented 5 years ago

is this pertinent given the use of Uniques?

How are uniques related?

Is there any evidence of the above issues occurring when performing profiling or benchmarking?

  1. The fact that there is no evidence at hand does not mean we can ignore a well-known problem and an advice from someone who implemented the library:

I've been wanting to switch to unpinned for ByteString for years, but it's a lot of work. It'd mean fixing up lots of other libs that use ByteString internals and it also requires a solution for the mmap'd ByteString use case (which does have solutions but it's still more work and may need some RTS extensions to be able to have an mmaped ByteArray# that unmaps when it's collected).

As a workaround I often recommend people use ShortByteString, which is part of the bytestring package and uses unpinned memory.

  1. Regardless of performance issues, there is a huge semantic issue: it's just plain nonsense to use lazy sequences of bytes for short human-readable text. At the very least it should be strict Text which would also save us some pain related to handling of variable names, like this for example:
language-plutus-core/generators/Language/PlutusCore/Generators/Internal/Entity.hs:import           Data.Text.Encoding                                      (encodeUtf8)
language-plutus-core/generators/Language/PlutusCore/Generators/Internal/Entity.hs:    Name ann (name <> BSL.fromStrict (encodeUtf8 . prettyText $ unUnique uniq)) uniq
language-plutus-core/src/Language/PlutusCore/Quote.hs:freshNameText ann = freshName ann . BSL.fromStrict . Text.encodeUtf8
effectfully commented 5 years ago

A Text value uses unpinned memory - meaning that it is controlled by the GHC garbage collector and can be moved around at will. -- string-types

So we can go with just strict Text for now and try ShortText later in a benchmark-driven way if we have time for that.

vmchale commented 5 years ago

A Text value uses unpinned memory - meaning that it is controlled by the GHC garbage collector and can be moved around at will. -- string-types

That's not relevant in our case.

effectfully commented 5 years ago

That's not relevant in our case.

Why not? Pinned memory is leaky, unpinned memory is not. Seems relevant to me.