seed fixed for shuffle before DESeq2

Transipedia / dekupl-run

Identify differentially expressed k-mers between RNA-Seq datasets

MIT License

11 stars 11 forks source link

seed fixed for shuffle before DESeq2 #61

Closed MarineAGLAVE closed 5 years ago

MarineAGLAVE commented 5 years ago

I had some reproducibility problems with DESeq2 on k-mers. The origin was the shuffle seed on the k-mers table before DESeq2. I propose a solution to fix the seed. It is based on the number of bytes of the raw-counts.tsv table to keep a maximum of random, but if we run exactly the same dekupl-run, we will have the same results now.

jaudoux commented 5 years ago

@MarineAGLAVE Thanks again for this new PR.

Why don't you directly use kmer_counts file as the --random-source for the shuf command ?

MarineAGLAVE commented 5 years ago

Because the argument --random-source can't read .gz files. I wanted to copy it to the temp folder, but I failed to implement it correctly (I can't put a variable to --random-source)

jaudoux commented 5 years ago

@MarineAGLAVE Could you just MD5Sum the gz file, and use the hash as the seed ?

MarineAGLAVE commented 5 years ago

@jaudoux I try with md5sum, but it doesn't work. I found why here: https://unix.stackexchange.com/questions/496788/does-the-size-of-the-random-source-file-matter The file given to the --random-source must have the half of the number of characters of the file to shuff; and md5sum doesn't give enough characters. But I propose this solution (new commit) which proposes 1 decompression, 1 writing and 2 readings of file, rather than 2 decompressions, 1 writing and 2 readings of file. It should be faster.

jaudoux commented 5 years ago

Hi @MarineAGLAVE,

Thanks for this updated version of your PR. As this add quite an overhead on the analysis (in term of disk and CPU), I would suggest that this should be an option of Dekupl (desactivated by default). The README should be updated to clearly state that the pipeline is not deterministic unless this options is activated, but explaining the overhead that will be applied.

Do you fell like doing this modifications ?

Best, Jérome.

MarineAGLAVE commented 5 years ago

@jaudoux , I made the change for the seed as an option. Its default value is 'fixed', but we can choose "not-fixed" by the configfile, if we want to keep the variability.

jaudoux commented 5 years ago

Ok great, I merge ;)