Issues Tokenizing for Keras in R

tonofshell commented 5 years ago

I’m having a strange issue with Homework 2. When I try to tokenize the bill titles from the training set, it appears to run for about a minute but then crashes R with a message that R needed to abort. If I try to knit the RMarkdown document, it just hangs forever at the block containing the tokenizer code. It does not crash if I try to tokenize the whole tibble instead of just the vector of titles, but that obviously just results in an empty tokenizer. I’ve experienced these issues on two machines both running RStudio 1.2.1335 on Windows 10, but each had different versions of R: 3.5.3 and 3.6.0. All of my latest code is up on my repo but I’ve copied the pertinent section below:

```{r prepare-data, cache=TRUE}
    max_words = 10000
    max_length = 100
    input_train = congress_train$Title
    tokenizer = text_tokenizer(num_words = max_words)
    tokenizer %>% fit_text_tokenizer(input_train) # appears to freeze/crash here
    sequences = texts_to_sequences(tokenizer, input_train)
    word_indices = tokenizer$word_index
    word_counts = tokenizer$word_counts
    data = pad_sequences(sequences, max_length)

bensoltoff commented 5 years ago

What about if you try

max_words = 10000
max_length = 100

# import training data
train <- read_csv("data/congress_train.csv")

# create tokenizer based on training data
tokenizer <- text_tokenizer(num_words = max_words) %>% 
  fit_text_tokenizer(train$Title)

# extract sequences
x_train <- texts_to_sequences(tokenizer, train$Title) %>%
  pad_sequences(maxlen = maxlen)

This uses text_tokenizer() and fit_text_tokenizer() in a single operation, to both initialize the tokenizer and establish the dictionary. Then you can use this to convert the bill titles in train to text sequences and perform the necessary padding.

Also, are you running this on your computer or a server in the cloud? Due to the size of the dataset, it may not be possible on a regular laptop to perform the tokenizing operation. You can check by subsetting train first to just 1,000 or 10,000 rows and using that for the tokenization. If it works on a subset of train but not the full train, you probably just need more computational resources to process the dataset

tonofshell commented 5 years ago

It indeed appears to be some computational restraint, or perhaps a bug, I'm not sure which. I can get up to a subset of about 270,000 lines before the crashing issue begins again. However, there's nothing concerning happening with resource utilization before the crash, I still have about 5 GB of RAM free and CPU utilization is at about 20%. At 270,000 lines, RStudio is using about 1.7 GB of RAM, but there's still plenty of headroom. Even after restarting and closing out a bunch of programs to free up RAM it still crashes on the full dataset though.

bensoltoff commented 5 years ago

Yes, that happened to me as well when I attempt to tokenize on my computer. You may need to perform that action using some sort of cloud resource. My guess is even if you get the dataset tokenized on your computer, you'll have significant difficulty estimating the deep learning models in a timely fashion due to their overall complexity

tonofshell commented 5 years ago

Well I have a GPU setup for computing the models, which is what I ran the previous assignment on. When I tried to use Google Cloud, it would run but would error out when trying to retrieve the results. I'm attempting to setup an EC2 instance now but it is also giving me errors saying that my requests for resources need to be validated. Hopefully something will start working sooner rather than later.

bensoltoff commented 5 years ago

Hmm. Maybe you could use Google Cloud just to tokenize the dataset? Export the padded sequences as .Rds objects, then download them to your computer and import them directly.

tonofshell commented 5 years ago

That is a good idea. I'll try that if Amazon doesn't sort itself out. Thank you for all the help!

tonofshell commented 5 years ago

I'm having the same issues using RStudio Server in a p2.xlarge EC2 instance. I can tokenize about 270,000 lines but running the full dataset only gives weird hangs and errors. When I tried to knit the full RMarkdown I did get an error this time though, which I've copied below:

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
1: .Call(`_reticulate_py_call_impl`, x, args, keywords)
2: py_call_impl(callable, dots$args, dots$keywords)
3: tokenizer$fit_on_texts(if (is.function(x)) reticulate::py_iterator(x) else as_texts(x))
4: fit_text_tokenizer(., input_train)
5: function_list[[k]](value)
6: withVisible(function_list[[k]](value))
7: freduce(value, `_function_list`)
8: `_fseq`(`_lhs`)
9: eval(quote(`_fseq`(`_lhs`)), env, env)
10: eval(quote(`_fseq`(`_lhs`)), env, env)
11: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
12: text_tokenizer(num_words = max_words) %>% fit_text_tokenizer(input_train)
13: eval(expr, envir, enclos)
14: eval(expr, envir, enclos)
15: withVisible(eval(expr, envir, enclos))
16: withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler,     error = eHandler, message = mHandler)
17: handle(ev <- withCallingHandlers(withVisible(eval(expr, envir,     enclos)), warning = wHandler, error = eHandler, message = mHandler))
18: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval(expr,     envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)))
19: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos,     debug = debug, last = i == length(out), use_try = stop_on_error !=         2L, keep_warning = keep_warning, keep_message = keep_message,     output_handler = output_handler, include_timing = include_timing)
20: evaluate::evaluate(...)
21: evaluate(code, envir = env, new_device = FALSE, keep_warning = !isFALSE(options$warning),     keep_message = !isFALSE(options$message), stop_on_error = if (options$error &&         options$include) 0L else 2L, output_handler = knit_handlers(options$render,         options))
22: in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE,     keep_warning = !isFALSE(options$warning), keep_message = !isFALSE(options$message),     stop_on_error = if (options$error && options$include) 0L else 2L,     output_handler = knit_handlers(options$render, options)))
23: block_exec(params)
24: call_block(x)
25: process_group.block(group)
26: process_group(group)
27: withCallingHandlers(if (tangle) process_tangle(group) else process_group(group),     error = function(e) {        setwd(wd)        cat(res, sep = "\n", file = output %n% "")        message("Quitting from lines ", paste(current_lines(i),             collapse = "-"), " (", knit_concord$get("infile"),             ") ")    })
28: process_file(text, output)
29: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet,     encoding = encoding)
30: rmarkdown::render("/home/adam/hw02/hw02.Rmd", encoding = "UTF-8")
An irrecoverable exception occurred. R is aborting now ...

tonofshell commented 5 years ago

I'm getting the same error message for the testing set in my EC2 instance, even though it has significantly less lines. I'm just going to submit whatever I can get to run.

css-research / hw02

Issues Tokenizing for Keras in R #6