Open RagnarGrootKoerkamp opened 1 month ago
inline
Hey,
I'm in the process of creating a wrapper crate for Rust bindings (https://github.com/RagnarGrootKoerkamp/libsais-rs/blob/master/src/lib.rs), so that we can easily use
libsais
for bioinformatics applications. First tests seem to give a very nice 2x speedup overlibdivsufsort
:)Some questions came up while looking at all the different versions of the
libsais_*
functions.1. Signedness of input text: The supported alphabets seems to be `u8`, `u16`, `i32` and `i64` (using rust terminology). Are there specific reasons the large integer alphabets are signed? For bioinformatics, we typically want to not waste any space/bits, and so often look at 'k-mers' of 16 or 32 DNA bases, that exactly fill a `u32` or `u64`. But it seems the library crashes when I try to use arbitrary (negative) `i32`/`i64` values. Said differently: could you also support `u32` and `u64` alphabets easily?
Unfortunately, this isn't straightforward because libsais utilizes the sign bit to mark specific types of suffixes. This is precisely why the i32/i64 version requires mutable input strings.
- Is there a difference between the
libsais_int
andlibsais16_int
functions? (And similarly forlibsais64_long
andlibsais16x64_long
) libsais and libsais16 are independent versions, though they share some identical functions as implementation. This includes the mentionedlibsais_int
andlibsais16_int
functions.- The
_int
/_long
versions require the alphabet sizek
to be passed in. Is this the number of distinct characters, or a (strict?) upper bound on the maximum value? If any non-negativei32
value is allowed, what should I use fork
? k represents the upper bound for the input alphabet, ideally set as the maximum value plus one. For u8 inputs, this is effectively 256. If your values could be any non-negative i32, I recommend compacting the alphabet by renumbering the values.- Generally speaking, how is the alphabet size used? Is it bad if I always pass in
i32::MAX
ori64::MAX
? Or is it bad if there are values (strictly?) larger than the alphabet size? Edit: after some more experimenting, it seems thatO(alphabet_size)
memory is allocated, which unfortunately causes problems for (very sparse)i64
alphabets. Is there an easy fix in the library? Otherwise, probably the easiest is for us to first re-map our values to0..k
, although that would require sorting the kmers which is slow in itself. libsais requires space for bucket counting and, to minimize allocations, it uses the free space in the 'SA' array. Since you are using the i32/i64 version of the library directly, it is recommended to allocate additional space in the SA array and pass the extra number of elements as the 'fs' (free space) parameter. A recommended value for 'fs' is 6 * k.- I suppose another minor thing is that only some functions come in a variant that takes the reusable
ctx
context parameter, but it's not clear to me how much impact this would have anyway, especially when for my case I expect 1 invocation on many GB of data, rather than many small invocations. 'ctx' is only relevant if you need to perform very frequent computations on smaller inputs (<64KB). In your case, you can ignore these variants.- Edit: I also seem to have some issue that the
_omp
function are not actually using multiple threads (for~100MB
input). Not sure what's going on there. This could be caused by a large value of k, which forces libsais to fall back to sequential algorithms, or it might be due to the Rust bindings. Additionally, libsais is more memory-bound than compute-bound. Therefore, adding more threads beyond twice the number of memory channels is unlikely to improve performance. What can boost performance even for single-threaded algorithm is the use of large pages, so I recommend enabling them.
I'm writing a separate comment because it wasn't clear what you meant by kmers as inputs. If you are trying to combine multiple sequential ACGT values into a single i32 value, that could hurt performance. Using ACGT as i8 inputs would be more efficient.
Hey,
I'm in the process of creating a wrapper crate for Rust bindings (https://github.com/RagnarGrootKoerkamp/libsais-rs/blob/master/src/lib.rs), so that we can easily use
libsais
for bioinformatics applications. First tests seem to give a very nice 2x speedup overlibdivsufsort
:)Some questions came up while looking at all the different versions of the
libsais_*
functions.Signedness of input text: The supported alphabets seems to be
u8
,u16
,i32
andi64
(using rust terminology). Are there specific reasons the large integer alphabets are signed? For bioinformatics, we typically want to not waste any space/bits, and so often look at 'k-mers' of 16 or 32 DNA bases, that exactly fill au32
oru64
. But it seems the library crashes when I try to use arbitrary (negative)i32
/i64
values.Said differently: could you also support
u32
andu64
alphabets easily?Is there a difference between the
libsais_int
andlibsais16_int
functions? (And similarly forlibsais64_long
andlibsais16x64_long
)The
_int
/_long
versions require the alphabet sizek
to be passed in. Is this the number of distinct characters, or a (strict?) upper bound on the maximum value? If any non-negativei32
value is allowed, what should I use fork
?Generally speaking, how is the alphabet size used? Is it bad if I always pass in
i32::MAX
ori64::MAX
? Or is it bad if there are values (strictly?) larger than the alphabet size?Edit: after some more experimenting, it seems that
O(alphabet_size)
memory is allocated, which unfortunately causes problems for (very sparse)i64
alphabets. Is there an easy fix in the library? Otherwise, probably the easiest is for us to first re-map our values to0..k
, although that would require sorting the kmers which is slow in itself.I suppose another minor thing is that only some functions come in a variant that takes the reusable
ctx
context parameter, but it's not clear to me how much impact this would have anyway, especially when for my case I expect 1 invocation on many GB of data, rather than many small invocations.Edit: I also seem to have some issue that the
_omp
function are not actually using multiple threads (for~100MB
input). Not sure what's going on there.