I'm trying to put a really large dataset of tweets into MonetDB using MonetDBLite. I have already preprocessed the data and I have over 500M rows and 123 columns, which are a mixture of string, double, logical and date columns. The data are stored in chunks in .rds format which row_bind() just fine, indicating colnames and col_types are uniform. However, when I put the data in MonetDB, I realised it gets corrupted. For instance, id_str are tweet IDs in string format and they should have same nchar() across all tweets. When I count the nchars() before putting the data in MonetDB, it works just fine
> temp_data %>%
+ select(id_str) %>%
+ mutate(id_nchar=nchar(id_str)) %>%
+ count(id_nchar)
# A tibble: 1 x 2
id_nchar n
<int> <int>
1 18 34403
However, when I try the same after putting the data in MonetDB, data gets corrupted.
dbdir <- "~/data/monetdb_identity_dataset" #directory of the Monet database, should be empty
con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), dbdir)
dbWriteTable(con, "identity_dataset",temp_data, append=T)
> dplyr::tbl(con, "identity_dataset") %>%
+ select(id_str) %>%
+ mutate(id_nchar=nchar(id_str)) %>%
+ count(id_nchar)
# Source: lazy query [?? x 2]
# Database: MonetDBEmbeddedConnection
id_nchar n
<int> <dbl>
1 18 32245
2 0 2088
3 NA 62
4 9 1
5 15 1
6 2 1
7 8 1
8 14 1
9 1 1
10 7 1
Seeing this, I tried decided to check the id_str to see if everything is OK but unfortunately, I get this, which is super weird:
> dplyr::tbl(con, "identity_dataset") %>%
+ select(id_str) %>%
+ mutate(id_nchar=nchar(id_str)) %>%
+ arrange(id_nchar)
# Source: lazy query [?? x 2]
# Database: MonetDBEmbeddedConnection
# Ordered by: id_nchar
id_str id_nchar
<chr> <int>
1 "\xe0\x8e" NA
2 "\xe0\x8e" NA
3 "\xe0\x8e" NA
4 "\xe0\x8e" NA
5 "\xe0\x8e" NA
6 "\xe0\x8e" NA
7 "\xe0\x8e" NA
8 "\xe0\x8e" NA
9 "\xe0\x8e" NA
10 "\xe0\x8e" NA
# ... with more rows
This is very interesting as I checked the source .rds files and these weird characters in id_str column certainly do not appear in the source data. They only appear after I put the data in MonetDB and query.
Is there anything I can do to debug this problem? There are certain columns (like tweet_text, user_description) which can contain newlines or commas, emojis etc and I am not sure if MonetDB's handling these special characters might be causing the issue or not (drawing on my prior experience with multiple csv parsers). Or could this be an encoding issue. I really like to use MonetDB for this project and hope there is a simple solution I might be missing.
I'm trying to put a really large dataset of tweets into MonetDB using MonetDBLite. I have already preprocessed the data and I have over 500M rows and 123 columns, which are a mixture of string, double, logical and date columns. The data are stored in chunks in .rds format which row_bind() just fine, indicating colnames and col_types are uniform. However, when I put the data in MonetDB, I realised it gets corrupted. For instance, id_str are tweet IDs in string format and they should have same nchar() across all tweets. When I count the nchars() before putting the data in MonetDB, it works just fine
However, when I try the same after putting the data in MonetDB, data gets corrupted.
Seeing this, I tried decided to check the id_str to see if everything is OK but unfortunately, I get this, which is super weird:
This is very interesting as I checked the source .rds files and these weird characters in id_str column certainly do not appear in the source data. They only appear after I put the data in MonetDB and query.
Is there anything I can do to debug this problem? There are certain columns (like tweet_text, user_description) which can contain newlines or commas, emojis etc and I am not sure if MonetDB's handling these special characters might be causing the issue or not (drawing on my prior experience with multiple csv parsers). Or could this be an encoding issue. I really like to use MonetDB for this project and hope there is a simple solution I might be missing.
Adding sessioninfo::session_info()