Closed MarineAGLAVE closed 5 years ago
@MarineAGLAVE Thanks again for this new PR.
Why don't you directly use kmer_counts file as the --random-source
for the shuf command ?
Because the argument --random-source
can't read .gz files.
I wanted to copy it to the temp folder, but I failed to implement it correctly (I can't put a variable to --random-source
)
@MarineAGLAVE Could you just MD5Sum the gz file, and use the hash as the seed ?
@jaudoux I try with md5sum, but it doesn't work. I found why here: https://unix.stackexchange.com/questions/496788/does-the-size-of-the-random-source-file-matter The file given to the --random-source must have the half of the number of characters of the file to shuff; and md5sum doesn't give enough characters. But I propose this solution (new commit) which proposes 1 decompression, 1 writing and 2 readings of file, rather than 2 decompressions, 1 writing and 2 readings of file. It should be faster.
Hi @MarineAGLAVE,
Thanks for this updated version of your PR. As this add quite an overhead on the analysis (in term of disk and CPU), I would suggest that this should be an option of Dekupl (desactivated by default). The README should be updated to clearly state that the pipeline is not deterministic unless this options is activated, but explaining the overhead that will be applied.
Do you fell like doing this modifications ?
Best, Jérome.
@jaudoux , I made the change for the seed as an option. Its default value is 'fixed', but we can choose "not-fixed" by the configfile, if we want to keep the variability.
Ok great, I merge ;)
I had some reproducibility problems with DESeq2 on k-mers. The origin was the shuffle seed on the k-mers table before DESeq2. I propose a solution to fix the seed. It is based on the number of bytes of the raw-counts.tsv table to keep a maximum of random, but if we run exactly the same dekupl-run, we will have the same results now.