HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
176 stars 32 forks source link

Dependency between seed and number of threads (cpus) #215

Open izaak-coleman opened 2 years ago

izaak-coleman commented 2 years ago

Hey all! First of all, awesome tool. It's been super useful for me so far, very easy to use.

I noticed an issue when trying to recreate an identical dataset (same input, same seed) across multiple machines.

I set the seed to 42 on both machines, one had 8 cpus, the other 4. I noticed the output of the tool differed: diff machine_1_R1.fastq machine_2_R1.fastq returned something.

I wondered if this was something to do with the reads being output in a random order due to the paralellism and that the reads were still identical despite this. This was not the case, the diff below returned something: cat machine_1_R1.fastq | sort > f1.fastq cat machine_2_R1.fastq | sort > f2.fastq diff f1.fastq f2.fastq

Further still, I wondered if the reads were being output in a random order due to the parallellism, but that the reads are not identical because the headers (perhaps a due to globally mutexed counter that gives the reads a unique id) were different. The DNA however, was still being sampled identically. To test this, I only output the DNA (i.e not the headers) from the fastq and run a diff: sed -n '2~4p' machine_1_R1.fastq | sort > f1.fastq # this will give us just the sequences sed -n '2~4p' machine_2_R1.fastq | sort > f2.fastq diff f1.fastq f2.fastq Again, this returned something. So, it seems the data is genuinely different despite the seed equalling 42!

The only difference left was that on one machine, I was constructing data with cpu=4, the other with cpu=8. It turns out that when I set both to cpu=4, the files were the same: diff machine_1_R1.fastq machine_2_R1.fastq returned nothing.

The last thing to check was that it was the differing machines and not the differing cpu numbers - perhaps, in some weird way the randomization algorithm would be different between the machines. But, (thank the good lord Number Forty-Two) this was not the case. I ran three runs on the same machine and compared the outputs, 8cpu vs 4cpu (run1) vs 4cpu (run2), the 8cpu output differed from the two 4cpu outputs, and the two 4cpu outputs were identical to one another.

I assume this is a bug, and not a feature - I can't think of any reason why you'd want this. It may be unfixable - sometimes dealing with parallelism is hard (i've been there). But, I thought i'd bring it to your attention: Right now, you don't have identical datasets being output despite identical seed (and data) inputs if the cpu numbers differ.

HadrienG commented 11 months ago

Hi!

Thanks for bringing this to my attention. I'm not sure how to go on about fixing it, but perhaps in the future.

/Hadrien