Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 978 forks source link

Obtaining different object from same file #5346

Open luigidolcetti opened 2 years ago

luigidolcetti commented 2 years ago

Hi,

probably a very simple issue to fix but I am strugling to solve:

I have a txt numeric table with column header.

identical( data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'numeric'), data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'numeric'))

return FALSE most of the time

while

identical( data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'character'), data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'character'))

return always TRUE

On the other hand base read.table() does not have this issue (but it's way slower). I would prefere to avoid loading the table as character to coerce it to numeric later (because of speed, otherwise I would have used read.table).

Any suggestion on how to read the same file twice and obtain identical objects (and why this is happening)?

Thank you in advance for help, Luigi

MichaelChirico commented 2 years ago

That certainly sounds like bad news! I don't know of any sources of randomness off the top of my head. The only thing I can think of is threading? Can you try again with nThread=1?

Beyond that it will be very tough for us to solve the problem without a reproducible example. Please share the data if you can, or scrub out details as much as possible if there's some privacy/proprietary concerns.

ben-schwen commented 2 years ago

If the data is not shareable, the output of verbose alone would be interesting for the case where identical is FALSE.

Since the problem is appearing with numeric, maybe there is an issue with parsing double or type bumps. Maybe worth to take a look at the absolute value differences? But all of these things should be deterministic.

luigidolcetti commented 2 years ago

thank you for your replies @MichaelChirico and @ben-schwen. Sorry, I do not feel like uploading files at the moment because they are coming from collaborators that might disagree...

Anyway, I had the chance to work a bit on these files, and what happens is that, for example, in a table 25000 x 24 I have 45 'errors' that do not happen in the same cells in consecutive iterations. This errors seems to behave this way: the character rapresentation could be someting like "5.760602" and the numeric 'visible' representation would be for two consecutive fread with colClasses = 'numeric' the same 5.760602... but doing dump() one would be 5.7606020000000004 and the other 5.7606019999999996.

I also tried on other similar tables exported from the same software (Imaging mass cytometry), but this problem does affect only random files.

luigidolcetti commented 2 years ago

@MichaelChirico, yes nThread = 1 seems to solve the problem. Thank you

ben-schwen commented 2 years ago

What apparently happens is that parsing doubles is dependent on the thread and not deterministic. The reason might be that 5.760602 is not exactly representable as 64-bit double.

Decimal            | Sign | Exponent    | Mantissa
5.7606020000000004 | 0    | 10000000001 | 0111000010101101101101000000001011010001011010111010
5.7606019999999996 | 0    | 10000000001 | 0111000010101101101101000000001011010001011010111001
MichaelChirico commented 2 years ago

what does the standard say about which 64-bit double is "correct" in this case? i.e. 5.760602 is exactly halfway between the two nearest representable doubles, is there any heuristic for which is preferred?

tlapak commented 2 years ago

@MichaelChirico you actually fixed this in #4463. Unfortunately, this hasn't made it onto CRAN yet because Matt only pushed the OMP patch #5172 for 1.14.2.

Some code to reproduce and check that it's fixed on dev:

library('data.table')

setDTthreads(2)

width <- 12
length <- 25000

numbers <- paste(c(paste(as.character(1:width), collapse = ','), rep(paste(rep('5.760602', width), collapse =','), length)), collapse = '\n')
d1 <- fread(text=numbers, header = TRUE, verbose = TRUE)
d2 <- fread(text=numbers, header = TRUE, verbose = TRUE)
identical(d1, d2)
# [1] TRUE
# ! This is usually TRUE, only rarely FALSE

e <- d1[[1]]
for (i in 2:length) {
  if (!identical(e[i-1], e[i])) {
    print(i - 1)
  }
}
# Prints the following on CRAN and nothing on dev:
# [1] 12500

The last bit prints where the parsed numbers change. I'm only 95% sure about freads logic on how many threads to use. Make sure you see it using two. This only shows up with at least two threads. Using more or larger data which gets broken up into more chunks introduces some randomness in the output (hence why OP even noticed it) based on, I presume, which chunk gets read by which thread. I don't really understand why the rounding would be different with the old lookup table based on thread, but, well, now it's not.

@luigidolcetti could you confirm that this is indeed fixed for your data with the latest development version? You should be able to install it with update.dev.pkg()

OfekShilon commented 2 years ago

@tlapak I'm unable to reproduce the !identical in your example with the CRAN version.
Maybe it's worth it to _mm_getcsr and fegetround at the start of every DT thread, and verify they are identical? Perhaps just dump them in verbose mode for now? Anyway that's the only source of nondeterminism I can suggest.

ben-schwen commented 2 years ago

I can confirm @tlapak example on my windows machine with 1.4.2.

tlapak commented 2 years ago

I have been able to test my example on a Ubuntu machine now where it does indeed not work. So this issue seems to be Windows specific. Did this occur on a Windows machine for you, @luigidolcetti?

luigidolcetti commented 2 years ago

@tlapak Yes, a relatively recent windows 10

ben-schwen commented 2 years ago

@luigidolcetti Would you mind sharing the output of sessionInfo() with us?

Does the issue still appear if you upgrade to 1.14.3 with data.table::update.dev.pkg()?

luigidolcetti commented 2 years ago

@ben-schwen, with 1.14.3 works fine with @tlapak example on a 12x250000 iterated some 20 times. Here is session info for the PC where I tried... sorry cannot access at the moment the pc where I first noticed the issue.

R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.14.3

loaded via a namespace (and not attached): [1] compiler_4.0.3 tools_4.0.3

ben-schwen commented 2 years ago

@luigidolcetti could you also try to update to 1.14.3 on the orignal PC and also retry it with the original problem?

luigidolcetti commented 2 years ago

Hi @ben-schwen, so I tryed on the first PC with my original dataset and unfortunatelly it seems that the problem persists even with version 1.14.3.... here is my sessioninfo

R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] RUNIMCTEMP_0.5.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.7 rstudioapi_0.13 raster_3.4-13 BiocGenerics_0.38.0 munsell_0.5.0
[6] colorspace_2.0-1 flowCore_2.4.0 lattice_0.20-44 R6_2.5.0 rlang_0.4.11
[11] tools_4.1.0 parallel_4.1.0 grid_4.1.0 Biobase_2.52.0 data.table_1.14.3
[16] matrixStats_0.59.0 digest_0.6.27 RcppParallel_5.1.4 randomForest_4.6-14 lifecycle_1.0.0
[21] crayon_1.4.1 cytolib_2.4.0 RProtoBufLib_2.4.0 S4Vectors_0.30.0 codetools_0.2-18
[26] ncdf4_1.17 sp_1.4-5 compiler_4.1.0 scales_1.1.1 stats4_4.1.0

with nThread =1 I get no problems but with any other value there are differences. For example with nThread=10 and a table like this:

'data.frame': 250000 obs. of 54 variables: $ Start_push : num 0 0 0 0 0 0 0 0 0 0 ... $ End_push : num 0 0 0 0 0 0 0 0 0 0 ... $ Pushes_duration : num 0 0 0 0 0 0 0 0 0 0 ... $ X : num 0 1 2 3 4 5 6 7 8 9 ... $ Y : num 0 0 0 0 0 0 0 0 0 0 ... $ Z : num 0 1 2 3 4 5 6 7 8 9 ... $ 120Sn(Sn120Di) : num 0 1 0 0 3.64 ... ....

I get 91 discrepancies like:

sprintf("%.60f", T1[32096,7]) [1] "2.414797999999999777998027639114297926425933837890625000000000" sprintf("%.60f", T2[32096,7]) [1] "2.414798000000000222087237489176914095878601074218750000000000"

tlapak commented 2 years ago

I can confirm that if you use my example with 2.414798 instead it shows the differing results on 1.14.3. Bizarrely, for me, it does not on 1.14.2. Still only on Windows and not on Ubuntu.

tlapak commented 2 years ago

@luigidolcetti could you test again with R 4.2 and a current build of data.table 1.14.3? I can no longer reproduce any issues since upgrading to R 4.2.

If running update.dev.pkg() doesn't work, you can download a compatible build here (Note this is a build produced by our CI process. I'm just not sure update.dev.pkg() will download the correct version. It has never worked for me at all tbh...). Alternatively, you can of course build it yourself, but make sure to install rtools42.

I strongly suspect that there was an issue with the compiler/openmp in the toolchain which now moved from gcc 8 to gcc 10.