IlyaGrebnov / libsais

libsais is a library for linear time suffix array, longest common prefix array and burrows wheeler transform construction based on induced sorting algorithm.
Apache License 2.0
185 stars 24 forks source link

Documentation questions for Rust bindings #26

Open RagnarGrootKoerkamp opened 1 month ago

RagnarGrootKoerkamp commented 1 month ago

Hey,

I'm in the process of creating a wrapper crate for Rust bindings (https://github.com/RagnarGrootKoerkamp/libsais-rs/blob/master/src/lib.rs), so that we can easily use libsais for bioinformatics applications. First tests seem to give a very nice 2x speedup over libdivsufsort :)

Some questions came up while looking at all the different versions of the libsais_* functions.

  1. Signedness of input text: The supported alphabets seems to be u8, u16, i32 and i64 (using rust terminology). Are there specific reasons the large integer alphabets are signed? For bioinformatics, we typically want to not waste any space/bits, and so often look at 'k-mers' of 16 or 32 DNA bases, that exactly fill a u32 or u64. But it seems the library crashes when I try to use arbitrary (negative) i32/i64 values.

    Said differently: could you also support u32 and u64 alphabets easily?

  2. Is there a difference between the libsais_int and libsais16_int functions? (And similarly for libsais64_long and libsais16x64_long)

  3. The _int/_long versions require the alphabet size k to be passed in. Is this the number of distinct characters, or a (strict?) upper bound on the maximum value? If any non-negative i32 value is allowed, what should I use for k?

  4. Generally speaking, how is the alphabet size used? Is it bad if I always pass in i32::MAX or i64::MAX? Or is it bad if there are values (strictly?) larger than the alphabet size?

    Edit: after some more experimenting, it seems that O(alphabet_size) memory is allocated, which unfortunately causes problems for (very sparse) i64 alphabets. Is there an easy fix in the library? Otherwise, probably the easiest is for us to first re-map our values to 0..k, although that would require sorting the kmers which is slow in itself.

  5. I suppose another minor thing is that only some functions come in a variant that takes the reusable ctx context parameter, but it's not clear to me how much impact this would have anyway, especially when for my case I expect 1 invocation on many GB of data, rather than many small invocations.

  6. Edit: I also seem to have some issue that the _omp function are not actually using multiple threads (for ~100MB input). Not sure what's going on there.

IlyaGrebnov commented 1 month ago

inline

Hey,

I'm in the process of creating a wrapper crate for Rust bindings (https://github.com/RagnarGrootKoerkamp/libsais-rs/blob/master/src/lib.rs), so that we can easily use libsais for bioinformatics applications. First tests seem to give a very nice 2x speedup over libdivsufsort :)

Some questions came up while looking at all the different versions of the libsais_* functions.

1. Signedness of input text: The supported alphabets seems to be `u8`, `u16`, `i32` and `i64` (using rust terminology). Are there specific reasons the large integer alphabets are signed? For bioinformatics, we typically want to not waste any space/bits, and so often look at 'k-mers' of 16 or 32 DNA bases, that exactly fill a `u32` or `u64`. But it seems the library crashes when I try to use arbitrary (negative) `i32`/`i64` values.
   Said differently: could you also support `u32` and `u64` alphabets easily?

Unfortunately, this isn't straightforward because libsais utilizes the sign bit to mark specific types of suffixes. This is precisely why the i32/i64 version requires mutable input strings.

  1. Is there a difference between the libsais_int and libsais16_int functions? (And similarly for libsais64_long and libsais16x64_long) libsais and libsais16 are independent versions, though they share some identical functions as implementation. This includes the mentioned libsais_int and libsais16_int functions.
  2. The _int/_long versions require the alphabet size k to be passed in. Is this the number of distinct characters, or a (strict?) upper bound on the maximum value? If any non-negative i32 value is allowed, what should I use for k? k represents the upper bound for the input alphabet, ideally set as the maximum value plus one. For u8 inputs, this is effectively 256. If your values could be any non-negative i32, I recommend compacting the alphabet by renumbering the values.
  3. Generally speaking, how is the alphabet size used? Is it bad if I always pass in i32::MAX or i64::MAX? Or is it bad if there are values (strictly?) larger than the alphabet size? Edit: after some more experimenting, it seems that O(alphabet_size) memory is allocated, which unfortunately causes problems for (very sparse) i64 alphabets. Is there an easy fix in the library? Otherwise, probably the easiest is for us to first re-map our values to 0..k, although that would require sorting the kmers which is slow in itself. libsais requires space for bucket counting and, to minimize allocations, it uses the free space in the 'SA' array. Since you are using the i32/i64 version of the library directly, it is recommended to allocate additional space in the SA array and pass the extra number of elements as the 'fs' (free space) parameter. A recommended value for 'fs' is 6 * k.
  4. I suppose another minor thing is that only some functions come in a variant that takes the reusable ctx context parameter, but it's not clear to me how much impact this would have anyway, especially when for my case I expect 1 invocation on many GB of data, rather than many small invocations. 'ctx' is only relevant if you need to perform very frequent computations on smaller inputs (<64KB). In your case, you can ignore these variants.
  5. Edit: I also seem to have some issue that the _omp function are not actually using multiple threads (for ~100MB input). Not sure what's going on there. This could be caused by a large value of k, which forces libsais to fall back to sequential algorithms, or it might be due to the Rust bindings. Additionally, libsais is more memory-bound than compute-bound. Therefore, adding more threads beyond twice the number of memory channels is unlikely to improve performance. What can boost performance even for single-threaded algorithm is the use of large pages, so I recommend enabling them.
IlyaGrebnov commented 1 month ago

I'm writing a separate comment because it wasn't clear what you meant by kmers as inputs. If you are trying to combine multiple sequential ACGT values into a single i32 value, that could hurt performance. Using ACGT as i8 inputs would be more efficient.