englianhu / binary.com-interview-question-data

Deriv统计建模科研项目的汇价数据库
https://gitee.com/englianhu
GNU General Public License v3.0
0 stars 2 forks source link

Confusing `POSIXlt` Warning !!! #4

Open englianhu opened 1 year ago

englianhu commented 1 year ago

议题

✖ 3.6 GiB [世博量化研究院*]❯ 样本
[data.table]: 
# A tibble: 1,324,800 × 12
   年月日时分           年份  季度  月份    周 周日  周分计 日分计 时分计  序列 日期      
   <dttm>              <dbl> <int> <int> <dbl> <chr>  <int>  <int>  <int> <int> <date>    
 1 2015-01-05 00:01:00  2015     1     1     1 周一       1      1      1     1 2015-01-05
 2 2015-01-05 00:02:00  2015     1     1     1 周一       2      2      2     2 2015-01-05
 3 2015-01-05 00:03:00  2015     1     1     1 周一       3      3      3     3 2015-01-05
 4 2015-01-05 00:04:00  2015     1     1     1 周一       4      4      4     4 2015-01-05
 5 2015-01-05 00:05:00  2015     1     1     1 周一       5      5      5     5 2015-01-05
 6 2015-01-05 00:06:00  2015     1     1     1 周一       6      6      6     6 2015-01-05
 7 2015-01-05 00:07:00  2015     1     1     1 周一       7      7      7     7 2015-01-05
 8 2015-01-05 00:08:00  2015     1     1     1 周一       8      8      8     8 2015-01-05
 9 2015-01-05 00:09:00  2015     1     1     1 周一       9      9      9     9 2015-01-05
10 2015-01-05 00:10:00  2015     1     1     1 周一      10     10     10    10 2015-01-05
# … with 1,324,790 more rows, and 1 more variable: 闭市价 <dbl>
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Warning messages:
1: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
2: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
3: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
4: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
5: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
6: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
7: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
8: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
9: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
10: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
✖ 3.6 GiB [世博量化研究院*]❯ head(样本$年月日时分)
[1] "2015-01-05 00:01:00 CST" "2015-01-05 00:02:00 CST" "2015-01-05 00:03:00 CST"
[4] "2015-01-05 00:04:00 CST" "2015-01-05 00:05:00 CST" "2015-01-05 00:06:00 CST"
✖ 3.6 GiB [世博量化研究院*]❯ class(样本$年月日时分)
[1] "POSIXct" "POSIXt"
✖ 3.6 GiB [世博量化研究院*]❯ anyNA(样本$年月日时分)
[1] FALSE
✖ 3.6 GiB [世博量化研究院*]❯ anyNA.POSIXlt(样本$年月日时分)
[1] FALSE
✖ 3.6 GiB [世博量化研究院*]❯ anyDuplicated.data.frame(样本)
[1] 0

《大秦赋》 忧从巫来,不可断绝; 何以解忧,唯有除巫。 秦人牧马,始于汧渭; 巫裔尽弃,瓦釜雷鸣。

上奏天朝:时间序列议题如上,而相关案例如下。

OK so what's happening is that the evaluation environment of j has strptime overwritten locally:

https://github.com/Rdatatable/data.table/blob/a8e926a48a87cd669ffe2ee310a73173be652f2b/R/data.table.R#L1151-L1154

From there, it doesn't discriminate on whether strptime is operating/producing a column. I don't think there's any easy fix to be more selective on this warning, but the message could be more helpful.

Note that AFAIK strptime can always be replaced by an as.POSIXct call (which wraps to as.POSIXlt-->strptime anyway), in which case j will be ignorant to strptime being called "under the hood" (since the call chain will end up at base::as.POSIXct and so base::strptime is used, not SDenv$strptime)

操作系统

✖ 3.6 GiB [世博量化研究院*]❯ session_info()$platform
 setting  value
 version  R version 4.2.2 (2022-10-31)
 os       RedFlag Desktop 11.0
 system   x86_64, linux-gnu
 ui       RStudio
 language zh_CN:en
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Asia/Shanghai
 date     2022-12-29
 rstudio  2022.12.0+353 Elsbeth Geranium (desktop)
 pandoc   2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
✖ 3.6 GiB [世博量化研究院*]❯ sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: RedFlag Desktop 11.0

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
...
...

相关资源:

englianhu commented 1 year ago
# --------- eval = FALSE ---------
## 检验是否已设置途径。
if (!exists('.蜀道')) {
  .蜀道 <- getwd() |> 
    {\(.) str_split(., '/')}() |> 
    {\(.) c('/', .[[1]][2:5])}() |> 
    {\(.) c(., 'binary.com-interview-question-data/')}() |> 
    {\(.) paste(., collapse = '/')}() |> 
    {\(.) substring(., 2)}()
}

if (!exists('.蜀道仓库')) {
  .蜀道仓库 <- paste0(.蜀道, '文艺数据库/fx/USDJPY/仓库/')
}

## 倘若环境尚未有数据,读取文件数据。
if (!exists('样本')) {
  样本 <- readRDS(paste0(.蜀道, '文艺数据库/fx/USDJPY/样本1.rds'))
  }

✖ 3.6 GiB [世博量化研究院*]❯ 样本 %>% filter(is.na(闭市价))
Source: local data table [7,200 x 12]
Call:   `_DT2`[is.na(闭市价)]

  年月日时分           年份  季度  月份    周 周日  周分计 日分计 时分计    序列 日期      
  <dttm>              <dbl> <int> <int> <dbl> <chr>  <int>  <int>  <int>   <int> <date>    
1 2018-01-02 00:01:00  2017     1     1    53 周二       1      1      1 1123201 2018-01-02
2 2018-01-02 00:02:00  2017     1     1    53 周二       3      3      3 1123203 2018-01-02
3 2018-01-02 00:03:00  2017     1     1    53 周二       5      5      5 1123205 2018-01-02
4 2018-01-02 00:04:00  2017     1     1    53 周二       7      7      7 1123207 2018-01-02
5 2018-01-02 00:05:00  2017     1     1    53 周二       9      9      9 1123209 2018-01-02
6 2018-01-02 00:06:00  2017     1     1    53 周二      11     11     11 1123211 2018-01-02
# … with 7,194 more rows, and 1 more variable: 闭市价 <dbl>
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

# Use as.data.table()/as.data.frame()/as_tibble() to access results
Warning messages:
1: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
2: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
3: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
4: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
5: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
6: In format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) :
  NAs introduced by coercion to integer range
✖ 3.6 GiB [世博量化研究院*]❯ 样本 %>% filter(is.na(闭市价)) %>% data.frame
           年月日时分 年份 季度 月份 周 周日 周分计 日分计 时分计    序列       日期 闭市价
1 2018-01-02 00:01:00 2017    1    1 53 周二      1      1      1 1123201 2018-01-02     NA
2 2018-01-02 00:02:00 2017    1    1 53 周二      3      3      3 1123203 2018-01-02     NA
3 2018-01-02 00:03:00 2017    1    1 53 周二      5      5      5 1123205 2018-01-02     NA
4 2018-01-02 00:04:00 2017    1    1 53 周二      7      7      7 1123207 2018-01-02     NA
5 2018-01-02 00:05:00 2017    1    1 53 周二      9      9      9 1123209 2018-01-02     NA
6 2018-01-02 00:06:00 2017    1    1 53 周二     11     11     11 1123211 2018-01-02     NA
7 2018-01-02 00:07:00 2017    1    1 53 周二     13     13     13 1123213 2018-01-02     NA
8 2018-01-02 00:08:00 2017    1    1 53 周二     15     15     15 1123215 2018-01-02     NA
 [ reached 'max' / getOption("max.print") -- omitted 7192 rows ]
✖ 3.6 GiB [世博量化研究院*]❯ 样本 %>% filter(is.na(闭市价)) %>% data.frame %>% tail
              年月日时分 年份 季度 月份 周 周日 周分计 日分计 时分计    序列       日期 闭市价
7195 2018-01-06 23:55:00 2017    1    1 53 周六   7189   1429     49 1137589 2018-01-06     NA
7196 2018-01-06 23:56:00 2017    1    1 53 周六   7191   1431     51 1137591 2018-01-06     NA
7197 2018-01-06 23:57:00 2017    1    1 53 周六   7193   1433     53 1137593 2018-01-06     NA
7198 2018-01-06 23:58:00 2017    1    1 53 周六   7195   1435     55 1137595 2018-01-06     NA
7199 2018-01-06 23:59:00 2017    1    1 53 周六   7197   1437     57 1137597 2018-01-06     NA
7200 2018-01-07 00:00:00 2017    1    1 53 周日   7199   1439     59 1137599 2018-01-07     NA
englianhu commented 1 year ago

骇客入侵,人为因素:样本$年月日时分 %<>% ymd_hms(tz = 'Asia/Shanghai')样本$年月日时分 %<>% as_datetime还是出现警讯,倘若样本[, 年月日时分 := format(年月日时分, '%Y-%m-%d %H:%M:%S', tz = 'Asia/Shanghai', usetz = TRUE)]日期格式就转成了文本。

englianhu commented 1 year ago

小插曲:数据应该使用经过过滤NA值和重新赋值周分计日分计时分计序列等参数和数据的样本2,而非样本1