MonetDB / MonetDBLite-R

MonetDB reconfigured as an R package - See below for an introduction. Edit
64 stars 13 forks source link

Data set is getting corrupted when put in MonetDB #33

Open sefabey opened 5 years ago

sefabey commented 5 years ago

I'm trying to put a really large dataset of tweets into MonetDB using MonetDBLite. I have already preprocessed the data and I have over 500M rows and 123 columns, which are a mixture of string, double, logical and date columns. The data are stored in chunks in .rds format which row_bind() just fine, indicating colnames and col_types are uniform. However, when I put the data in MonetDB, I realised it gets corrupted. For instance, id_str are tweet IDs in string format and they should have same nchar() across all tweets. When I count the nchars() before putting the data in MonetDB, it works just fine

> temp_data %>%
+   select(id_str) %>%
+   mutate(id_nchar=nchar(id_str)) %>%
+   count(id_nchar)
# A tibble: 1 x 2
  id_nchar     n
     <int> <int>
1       18 34403

However, when I try the same after putting the data in MonetDB, data gets corrupted.


dbdir <- "~/data/monetdb_identity_dataset" #directory of the Monet database, should be empty
con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), dbdir)
dbWriteTable(con, "identity_dataset",temp_data, append=T)

> dplyr::tbl(con, "identity_dataset") %>%
+   select(id_str) %>%
+   mutate(id_nchar=nchar(id_str)) %>%
+   count(id_nchar)
# Source:   lazy query [?? x 2]
# Database: MonetDBEmbeddedConnection
   id_nchar     n
      <int> <dbl>
 1       18 32245
 2        0  2088
 3       NA    62
 4        9     1
 5       15     1
 6        2     1
 7        8     1
 8       14     1
 9        1     1
10        7     1

Seeing this, I tried decided to check the id_str to see if everything is OK but unfortunately, I get this, which is super weird:


> dplyr::tbl(con, "identity_dataset") %>%
+   select(id_str) %>%
+   mutate(id_nchar=nchar(id_str)) %>%
+   arrange(id_nchar)
# Source:     lazy query [?? x 2]
# Database:   MonetDBEmbeddedConnection
# Ordered by: id_nchar
   id_str     id_nchar
   <chr>         <int>
 1 "\xe0\x8e"       NA
 2 "\xe0\x8e"       NA
 3 "\xe0\x8e"       NA
 4 "\xe0\x8e"       NA
 5 "\xe0\x8e"       NA
 6 "\xe0\x8e"       NA
 7 "\xe0\x8e"       NA
 8 "\xe0\x8e"       NA
 9 "\xe0\x8e"       NA
10 "\xe0\x8e"       NA
# ... with more rows

This is very interesting as I checked the source .rds files and these weird characters in id_str column certainly do not appear in the source data. They only appear after I put the data in MonetDB and query.

Is there anything I can do to debug this problem? There are certain columns (like tweet_text, user_description) which can contain newlines or commas, emojis etc and I am not sure if MonetDB's handling these special characters might be causing the issue or not (drawing on my prior experience with multiple csv parsers). Or could this be an encoding issue. I really like to use MonetDB for this project and hope there is a simple solution I might be missing.

Adding sessioninfo::session_info()


> sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 3.5.1 (2018-07-02)
 os       Red Hat Enterprise Linux Server 7.4 (Maipo)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/London
 date     2018-12-29

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source
 assertthat    0.2.0   2017-04-11 [2] CRAN (R 3.5.1)
 bindr         0.1.1   2018-03-13 [2] CRAN (R 3.5.1)
 bindrcpp    * 0.2.2   2018-03-29 [2] CRAN (R 3.5.1)
 cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.1)
 codetools     0.2-16  2018-12-24 [1] CRAN (R 3.5.1)
 crayon        1.3.4   2017-09-16 [2] CRAN (R 3.5.1)
 DBI         * 1.0.0   2018-05-02 [2] CRAN (R 3.5.1)
 dbplyr        1.2.2   2018-07-25 [1] CRAN (R 3.5.1)
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
 dplyr       * 0.7.8   2018-11-10 [1] CRAN (R 3.5.1)
 fansi         0.4.0   2018-10-05 [1] CRAN (R 3.5.1)
 furrr       * 0.1.0   2018-05-16 [1] CRAN (R 3.5.1)
 future      * 1.10.0  2018-10-17 [1] CRAN (R 3.5.1)
 globals       0.12.4  2018-10-11 [1] CRAN (R 3.5.1)
 glue          1.3.0   2018-07-17 [2] CRAN (R 3.5.1)
 hms           0.4.2   2018-03-10 [1] CRAN (R 3.5.1)
 janitor       1.1.1   2018-07-31 [1] CRAN (R 3.5.1)
 listenv       0.7.0   2018-01-21 [1] CRAN (R 3.5.1)
 lubridate   * 1.7.4   2018-04-11 [1] CRAN (R 3.5.1)
 magrittr      1.5     2014-11-22 [2] CRAN (R 3.5.1)
 MonetDBLite * 0.6.0   2018-07-27 [1] CRAN (R 3.5.1)
 pillar        1.3.1   2018-12-15 [1] CRAN (R 3.5.1)
 pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)
 purrr         0.2.5   2018-05-29 [2] CRAN (R 3.5.1)
 R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.1)
 Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.1)
 readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.5.1)
 rlang         0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)
 snakecase     0.9.2   2018-08-14 [1] CRAN (R 3.5.1)
 stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.1)
 stringr     * 1.3.1   2018-05-10 [2] CRAN (R 3.5.1)
 tibble        1.4.2   2018-01-22 [2] CRAN (R 3.5.1)
 tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)
 utf8          1.1.4   2018-05-24 [2] CRAN (R 3.5.1)
 withr         2.1.2   2018-03-15 [2] CRAN (R 3.5.1)