HudsonAlpha / fmlrc2

Apache License 2.0
43 stars 5 forks source link

fmlrc2: thread 'main' panicked at 'index out of bounds: the len is 0 but the index is ... #14

Closed mmokrejs closed 3 years ago

mmokrejs commented 3 years ago

Hi, seems the -C parameter is related to read length. But reads were 125nt in length but even with -C 100 it crashes. -C 10 is fine. But why 10-mers? Is there any use for them? I use current git master checkout.

fmlrc2 -t 16 -C 100 -k=21,59,79 comp_msbwt.npy CCS2.0000006678.fastq.gz L319_301_S9_L003.trimmomatic.tadpole.k62.shave.rinse.pairs.fasta
[2021-10-13T18:59:16Z INFO  fmlrc2] Input parameters (required):
[2021-10-13T18:59:16Z INFO  fmlrc2]     BWT: "comp_msbwt.npy"
[2021-10-13T18:59:16Z INFO  fmlrc2]     Input reads: "CCS2.0000006678.fastq.gz"
[2021-10-13T18:59:16Z INFO  fmlrc2]     Output corrected reads: "L319_301_S9_L003.trimmomatic.tadpole.k62.shave.rinse.pairs.fasta"
[2021-10-13T18:59:16Z INFO  fmlrc2] Execution Parameters:
[2021-10-13T18:59:16Z INFO  fmlrc2]     verbose: false
[2021-10-13T18:59:16Z INFO  fmlrc2]     threads: 16
[2021-10-13T18:59:16Z INFO  fmlrc2]     cache size: 100
[2021-10-13T18:59:16Z INFO  fmlrc2] Correction Parameters:
[2021-10-13T18:59:16Z INFO  fmlrc2]     reads to correct: [0, 18446744073709551615)
[2021-10-13T18:59:16Z INFO  fmlrc2]     k-mer sizes: [21, 59]
[2021-10-13T18:59:16Z INFO  fmlrc2]     abs. mininimum count: 5
[2021-10-13T18:59:16Z INFO  fmlrc2]     dyn. minimimum fraction: 0.1
[2021-10-13T18:59:16Z INFO  fmlrc2]     branching factor: 4
[2021-10-13T18:59:16Z INFO  fmlrc::bv_bwt] Loading BWT with 1 compressed values
[2021-10-13T18:59:16Z INFO  fmlrc::bv_bwt] Loaded BWT with symbol counts: [1, 0, 0, 0, 0, 0]
[2021-10-13T18:59:16Z INFO  fmlrc::bv_bwt] Allocating binary vectors...
[2021-10-13T18:59:16Z INFO  fmlrc::bv_bwt] Calculating binary vectors...
[2021-10-13T18:59:16Z INFO  fmlrc::bv_bwt] Constructing FM-indices...
[2021-10-13T18:59:16Z INFO  fmlrc::bv_bwt] Building 100-mer cache...
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 3689348814741910323', src/bv_bwt.rs:400:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
holtjma commented 3 years ago

Hello,

The -C parameter is not actually related to your read length at all, but it instead controls the length of pre-computed k-mers from the BWT. If you set it to 10, it will precompute the locations of all 10-mers. This is approximately 6^10 different k-mers and generally takes about 1GB of memory, it also does this relatively quickly in terms of run-time. Setting it lower will use less memory but have longer downstream query times, setting it higher will use more memory but save query time downstream.

For example, setting it to 11-mers, would use about 6GB and probably ~1 minute of pre-compute, 12 is 36GB and pre-compute in 6 minutes, etc. It will continue to grow exponentially and is not really practical beyond a certain point (probably in the 15-20 range, but even then you need lots of memory and the pre-compute becomes quite expensive). In practice, 8 is a good default because memory and run-time are negligible for basically every use case. We typically use 10 because 1GB of memory is relatively negligible for us, and the relatively small compute time is worth it for downstream acceleration.

In most use cases, I would recommend you use C<=10. Does this answer your question? Is there some other parameter you're trying to tune?

mmokrejs commented 3 years ago

I thought to understand from the README.md that larger is better, so I asked for 400GB of RAM. But it seems there is no advantage to derive the larger k-mers from the index? Would you mind improving the README.md to make it clearer why I should not even go for -C 100? Why not? I don't mind longer build time if the results are better. If tehy won't be any better then I am missing something. OK, this is a cache but still, ...

Anyway, I think the app should not crash.

holtjma commented 3 years ago

When you pass -C 100, you're asking it to pre-compute 6^100 k-mers which is a truly astronomical number and several orders of magnitude more than any existing dataset. (For example, 6^100 is approximately 10^77, this is close to the estimation of the number of atoms in the universe). The reason the app is crashing is because the computer cannot allocate that much memory, and then fails.

For your use case, I just recommend you start with C=10. Then, if you want, you can increase it to see at what point you reach an optimal practical run-time for your datasets.