Open luigidolcetti opened 2 years ago
That certainly sounds like bad news! I don't know of any sources of randomness off the top of my head. The only thing I can think of is threading? Can you try again with nThread=1
?
Beyond that it will be very tough for us to solve the problem without a reproducible example. Please share the data if you can, or scrub out details as much as possible if there's some privacy/proprietary concerns.
If the data is not shareable, the output of verbose alone would be interesting for the case where identical
is FALSE
.
Since the problem is appearing with numeric
, maybe there is an issue with parsing double or type bumps. Maybe worth to take a look at the absolute value differences? But all of these things should be deterministic.
thank you for your replies @MichaelChirico and @ben-schwen. Sorry, I do not feel like uploading files at the moment because they are coming from collaborators that might disagree...
Anyway, I had the chance to work a bit on these files, and what happens is that, for example, in a table 25000 x 24 I have 45 'errors' that do not happen in the same cells in consecutive iterations. This errors seems to behave this way: the character rapresentation could be someting like "5.760602" and the numeric 'visible' representation would be for two consecutive fread with colClasses = 'numeric' the same 5.760602... but doing dump() one would be 5.7606020000000004 and the other 5.7606019999999996.
I also tried on other similar tables exported from the same software (Imaging mass cytometry), but this problem does affect only random files.
@MichaelChirico, yes nThread = 1 seems to solve the problem. Thank you
What apparently happens is that parsing doubles is dependent on the thread and not deterministic.
The reason might be that 5.760602
is not exactly representable as 64-bit double.
Decimal | Sign | Exponent | Mantissa
5.7606020000000004 | 0 | 10000000001 | 0111000010101101101101000000001011010001011010111010
5.7606019999999996 | 0 | 10000000001 | 0111000010101101101101000000001011010001011010111001
what does the standard say about which 64-bit double is "correct" in this case? i.e. 5.760602 is exactly halfway between the two nearest representable doubles, is there any heuristic for which is preferred?
@MichaelChirico you actually fixed this in #4463. Unfortunately, this hasn't made it onto CRAN yet because Matt only pushed the OMP patch #5172 for 1.14.2.
Some code to reproduce and check that it's fixed on dev:
library('data.table')
setDTthreads(2)
width <- 12
length <- 25000
numbers <- paste(c(paste(as.character(1:width), collapse = ','), rep(paste(rep('5.760602', width), collapse =','), length)), collapse = '\n')
d1 <- fread(text=numbers, header = TRUE, verbose = TRUE)
d2 <- fread(text=numbers, header = TRUE, verbose = TRUE)
identical(d1, d2)
# [1] TRUE
# ! This is usually TRUE, only rarely FALSE
e <- d1[[1]]
for (i in 2:length) {
if (!identical(e[i-1], e[i])) {
print(i - 1)
}
}
# Prints the following on CRAN and nothing on dev:
# [1] 12500
The last bit prints where the parsed numbers change. I'm only 95% sure about freads logic on how many threads to use. Make sure you see it using two. This only shows up with at least two threads. Using more or larger data which gets broken up into more chunks introduces some randomness in the output (hence why OP even noticed it) based on, I presume, which chunk gets read by which thread. I don't really understand why the rounding would be different with the old lookup table based on thread, but, well, now it's not.
@luigidolcetti could you confirm that this is indeed fixed for your data with the latest development version? You should be able to install it with update.dev.pkg()
@tlapak I'm unable to reproduce the !identical
in your example with the CRAN version.
Maybe it's worth it to _mm_getcsr and fegetround at the start of every DT thread, and verify they are identical? Perhaps just dump them in verbose mode for now? Anyway that's the only source of nondeterminism I can suggest.
I can confirm @tlapak example on my windows machine with 1.4.2
.
I have been able to test my example on a Ubuntu machine now where it does indeed not work. So this issue seems to be Windows specific. Did this occur on a Windows machine for you, @luigidolcetti?
@tlapak Yes, a relatively recent windows 10
@luigidolcetti Would you mind sharing the output of sessionInfo()
with us?
Does the issue still appear if you upgrade to 1.14.3
with data.table::update.dev.pkg()
?
@ben-schwen, with 1.14.3 works fine with @tlapak example on a 12x250000 iterated some 20 times. Here is session info for the PC where I tried... sorry cannot access at the moment the pc where I first noticed the issue.
R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] data.table_1.14.3
loaded via a namespace (and not attached): [1] compiler_4.0.3 tools_4.0.3
@luigidolcetti could you also try to update to 1.14.3 on the orignal PC and also retry it with the original problem?
Hi @ben-schwen, so I tryed on the first PC with my original dataset and unfortunatelly it seems that the problem persists even with version 1.14.3.... here is my sessioninfo
R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] RUNIMCTEMP_0.5.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 rstudioapi_0.13 raster_3.4-13 BiocGenerics_0.38.0 munsell_0.5.0
[6] colorspace_2.0-1 flowCore_2.4.0 lattice_0.20-44 R6_2.5.0 rlang_0.4.11
[11] tools_4.1.0 parallel_4.1.0 grid_4.1.0 Biobase_2.52.0 data.table_1.14.3
[16] matrixStats_0.59.0 digest_0.6.27 RcppParallel_5.1.4 randomForest_4.6-14 lifecycle_1.0.0
[21] crayon_1.4.1 cytolib_2.4.0 RProtoBufLib_2.4.0 S4Vectors_0.30.0 codetools_0.2-18
[26] ncdf4_1.17 sp_1.4-5 compiler_4.1.0 scales_1.1.1 stats4_4.1.0
with nThread =1 I get no problems but with any other value there are differences. For example with nThread=10 and a table like this:
'data.frame': 250000 obs. of 54 variables: $ Start_push : num 0 0 0 0 0 0 0 0 0 0 ... $ End_push : num 0 0 0 0 0 0 0 0 0 0 ... $ Pushes_duration : num 0 0 0 0 0 0 0 0 0 0 ... $ X : num 0 1 2 3 4 5 6 7 8 9 ... $ Y : num 0 0 0 0 0 0 0 0 0 0 ... $ Z : num 0 1 2 3 4 5 6 7 8 9 ... $ 120Sn(Sn120Di) : num 0 1 0 0 3.64 ... ....
I get 91 discrepancies like:
sprintf("%.60f", T1[32096,7]) [1] "2.414797999999999777998027639114297926425933837890625000000000" sprintf("%.60f", T2[32096,7]) [1] "2.414798000000000222087237489176914095878601074218750000000000"
I can confirm that if you use my example with 2.414798 instead it shows the differing results on 1.14.3. Bizarrely, for me, it does not on 1.14.2. Still only on Windows and not on Ubuntu.
@luigidolcetti could you test again with R 4.2 and a current build of data.table 1.14.3? I can no longer reproduce any issues since upgrading to R 4.2.
If running update.dev.pkg()
doesn't work, you can download a compatible build here (Note this is a build produced by our CI process. I'm just not sure update.dev.pkg()
will download the correct version. It has never worked for me at all tbh...). Alternatively, you can of course build it yourself, but make sure to install rtools42.
I strongly suspect that there was an issue with the compiler/openmp in the toolchain which now moved from gcc 8 to gcc 10.
Hi,
probably a very simple issue to fix but I am strugling to solve:
I have a txt numeric table with column header.
identical( data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'numeric'), data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'numeric'))
return FALSE most of the time
while
identical( data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'character'), data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'character'))
return always TRUE
On the other hand base read.table() does not have this issue (but it's way slower). I would prefere to avoid loading the table as character to coerce it to numeric later (because of speed, otherwise I would have used read.table).
Any suggestion on how to read the same file twice and obtain identical objects (and why this is happening)?
Thank you in advance for help, Luigi