JMMackenzie / BatchSBWT

GNU General Public License v2.0
0 stars 0 forks source link

How to reproduce the Blackwell experiment? #1

Open karel-brinda opened 1 week ago

karel-brinda commented 1 week ago

Hello,

I've enjoyed reading the paper "Batched k-mer lookup on the Spectral Burrows-Wheeler Transform".

I was very curious about how the indexing of the Blackwell dataset was done as this is methodologically most challenging point and probably most critical part for interpreting the results.

However, I haven't managed to find any information in the paper, only that "The code to reproduce the experiments is available at: https://github.com/JMMackenzie/ BatchSBWT/".

Would it be possible to provide a pointer to how the Blackwell index was constructed (ideally incl. how much RAM and disk space was necessary)? Is the resulting index available anywhere?

Thanks a lot! Karel

JMMackenzie commented 6 days ago

Hi Karel,

Thanks a lot for your interest! I think @jnalanko would be the best person to answer this, as I believe he already had the Blackwell SBWT built when I joined the project (if you have not seen it, the SBWT paper is here: https://www.biorxiv.org/content/10.1101/2022.05.19.492613v1)

I'd be happy to re-run this if need be, though, so please keep me posted!

Cheers, Joel

jnalanko commented 6 days ago

The index was built for our Themisto paper. It's available here: https://zenodo.org/records/7736981, along with commands to build it with Themisto. There is one color per species, not per genome, which makes the color indexing feasible. We extracted the SBWT part of the index by shaving off a few header bytes manually from the .tdbg file. If you need, we can provide a script to shave off those bytes correctly to get a file that can be loaded with the C++ SBWT library.

The indexing took 89 hours on machine with 48 physical cores (96 with hyperthreading) and 1.5TB available RAM. I did not measure the RAM peak, but would guess that the peak RAM was somewhere in the hundreds of gigabytes. I think we had about ~5TB of disk available at the machine, so that amount of disk should be enough. When run with 31-mers, SBWT construction dumps all k-mers to disk using 8 bytes per k-mer. This dataset had 71 billion k-mers, so the disk space needed is at the very least 71*8 = 568GB. But at some point while extracting the KMC database, the space can be twice this. And then some temporary space is needed for the colors. I'm not sure how much.

If you don't need the colors, you don't need Themisto and you can just run the SBWT construction by itself using the CLI tool. To replicate what Themisto is doing, first preprocess the data into unitigs using ggcat, and then feed to SBWT construction both the forward and reverse strands of each unitig to avoid an explosion of dummy nodes. (Alternatively, you could also try using unitig flipper to try to orient the unitigs in a way to minimize dummy nodes, in which case you could try indexing without reverse complements to get a smaller index.)