caseywdunn / sharkmer

3 stars 0 forks source link

do not store reads #8

Open caseywdunn opened 10 months ago

caseywdunn commented 10 months ago

It has become infeasible to store reads. On larger dataset, ran out of RAM on a 2TB machine.

Main reason I was storing reads was that I needed to know how many there are so I know how many to allocate to each chunk for hashing. Can't read them twice, once to count and once to hash, since sometimes they come from STDIN.

So will take a new strategy. Will read in batches that will be much smaller than chunk size, and allocate them to each chunk as they are read.

There are two ways to do this. Could split each batch and fan reads out across chunks, or I could allocate each batch in its entirety to a chunk, and then put the next one in the next chunk, etc, and then wrap around and do it again.