adriantich / DnoisE

Distance denoise by Entropy
GNU General Public License v3.0
12 stars 3 forks source link

Bug; Different denoising output with 1 core vs. multiple cores #28

Closed kbseah closed 7 months ago

kbseah commented 7 months ago

The denoised sequences output by DnoisE are different when using one core (option -c 1), vs. using more than one core. The output appears to be identical for values of -c other than one that I have tested.

The number of cores used should not affect the program output, so this is a bug that affects the program's correctness.

The bug appears to have been introduced somewhere between tags v.1.2.0 and v.1.3.0:

However, the denoised sequences output by DnoisE with >=2 cores does not appear to differ between versions, except for the sort order the sequences are the same for the versions I have checked.

I have only tested the denoising of Fasta input files, but not the other usage modes of the program.

Reproducing the issue

I used the test file at test-DnoisE/sample1000.fasta. I installed each version of DnoisE to a venv, with different versions of pandas (see above) depending on the DnoisE version. Replace $DNOISE with either dnoise (v1.4.0) or python3 src/DnoisE.py (earlier versions), and run the following commands:

$DNOISE --fasta_input sample1000.fasta --fasta_output output1 -c 1
$DNOISE --fasta_input sample1000.fasta --fasta_output output2 -c 2
$DNOISE --fasta_input sample1000.fasta --fasta_output output8 -c 8

Compare the outputs

diff --brief output1_denoised_ratio_d.fasta output2_denoised_ratio_d.fasta # differ with dnoise versions including and after v.1.3.0
diff --brief output8_denoised_ratio_d.fasta output2_denoised_ratio_d.fasta # do not difer
adriantich commented 7 months ago

Hi, Thanks for reporting the bug. It seems that the output from -c 1 is the one that is wrong. I'm checking this and hopefully tomorrow it will be solved.

A.

adriantich commented 7 months ago

What I found is that in file running_dnoise.py, when de.cores == 1, in the for loop in which I create a dataframe with all the mothers in which the daughters are merged, each line needed to be transposed. As it was not, theese sequences were lost. If you check your outputs you will see that the difference are found in the first lines where in the output1 the first three sequences (those that are identified as mothers) are not there.

I have changed this for both running with and without entropy. In case this problem persists, please, report it.

expect the commit tomorrow

Thank you so much, Adrià

kbseah commented 7 months ago

thanks Adria for looking into this! looking forward to the fix. could you please also make a new release on Github so that this can be propagated to Bioconda, too?

adriantich commented 7 months ago

New release Done, thanks again for reporting the bug