Open tonofshell opened 5 years ago
What about if you try
max_words = 10000
max_length = 100
# import training data
train <- read_csv("data/congress_train.csv")
# create tokenizer based on training data
tokenizer <- text_tokenizer(num_words = max_words) %>%
fit_text_tokenizer(train$Title)
# extract sequences
x_train <- texts_to_sequences(tokenizer, train$Title) %>%
pad_sequences(maxlen = maxlen)
This uses text_tokenizer()
and fit_text_tokenizer()
in a single operation, to both initialize the tokenizer and establish the dictionary. Then you can use this to convert the bill titles in train
to text sequences and perform the necessary padding.
Also, are you running this on your computer or a server in the cloud? Due to the size of the dataset, it may not be possible on a regular laptop to perform the tokenizing operation. You can check by subsetting train
first to just 1,000 or 10,000 rows and using that for the tokenization. If it works on a subset of train
but not the full train
, you probably just need more computational resources to process the dataset
It indeed appears to be some computational restraint, or perhaps a bug, I'm not sure which. I can get up to a subset of about 270,000 lines before the crashing issue begins again. However, there's nothing concerning happening with resource utilization before the crash, I still have about 5 GB of RAM free and CPU utilization is at about 20%. At 270,000 lines, RStudio is using about 1.7 GB of RAM, but there's still plenty of headroom. Even after restarting and closing out a bunch of programs to free up RAM it still crashes on the full dataset though.
Yes, that happened to me as well when I attempt to tokenize on my computer. You may need to perform that action using some sort of cloud resource. My guess is even if you get the dataset tokenized on your computer, you'll have significant difficulty estimating the deep learning models in a timely fashion due to their overall complexity
Well I have a GPU setup for computing the models, which is what I ran the previous assignment on. When I tried to use Google Cloud, it would run but would error out when trying to retrieve the results. I'm attempting to setup an EC2 instance now but it is also giving me errors saying that my requests for resources need to be validated. Hopefully something will start working sooner rather than later.
Hmm. Maybe you could use Google Cloud just to tokenize the dataset? Export the padded sequences as .Rds
objects, then download them to your computer and import them directly.
That is a good idea. I'll try that if Amazon doesn't sort itself out. Thank you for all the help!
I'm having the same issues using RStudio Server in a p2.xlarge EC2 instance. I can tokenize about 270,000 lines but running the full dataset only gives weird hangs and errors. When I tried to knit the full RMarkdown I did get an error this time though, which I've copied below:
*** caught segfault ***
address (nil), cause 'memory not mapped'
Traceback:
1: .Call(`_reticulate_py_call_impl`, x, args, keywords)
2: py_call_impl(callable, dots$args, dots$keywords)
3: tokenizer$fit_on_texts(if (is.function(x)) reticulate::py_iterator(x) else as_texts(x))
4: fit_text_tokenizer(., input_train)
5: function_list[[k]](value)
6: withVisible(function_list[[k]](value))
7: freduce(value, `_function_list`)
8: `_fseq`(`_lhs`)
9: eval(quote(`_fseq`(`_lhs`)), env, env)
10: eval(quote(`_fseq`(`_lhs`)), env, env)
11: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
12: text_tokenizer(num_words = max_words) %>% fit_text_tokenizer(input_train)
13: eval(expr, envir, enclos)
14: eval(expr, envir, enclos)
15: withVisible(eval(expr, envir, enclos))
16: withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)
17: handle(ev <- withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler))
18: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)))
19: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos, debug = debug, last = i == length(out), use_try = stop_on_error != 2L, keep_warning = keep_warning, keep_message = keep_message, output_handler = output_handler, include_timing = include_timing)
20: evaluate::evaluate(...)
21: evaluate(code, envir = env, new_device = FALSE, keep_warning = !isFALSE(options$warning), keep_message = !isFALSE(options$message), stop_on_error = if (options$error && options$include) 0L else 2L, output_handler = knit_handlers(options$render, options))
22: in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE, keep_warning = !isFALSE(options$warning), keep_message = !isFALSE(options$message), stop_on_error = if (options$error && options$include) 0L else 2L, output_handler = knit_handlers(options$render, options)))
23: block_exec(params)
24: call_block(x)
25: process_group.block(group)
26: process_group(group)
27: withCallingHandlers(if (tangle) process_tangle(group) else process_group(group), error = function(e) { setwd(wd) cat(res, sep = "\n", file = output %n% "") message("Quitting from lines ", paste(current_lines(i), collapse = "-"), " (", knit_concord$get("infile"), ") ") })
28: process_file(text, output)
29: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet, encoding = encoding)
30: rmarkdown::render("/home/adam/hw02/hw02.Rmd", encoding = "UTF-8")
An irrecoverable exception occurred. R is aborting now ...
I'm getting the same error message for the testing set in my EC2 instance, even though it has significantly less lines. I'm just going to submit whatever I can get to run.
I’m having a strange issue with Homework 2. When I try to tokenize the bill titles from the training set, it appears to run for about a minute but then crashes R with a message that
R needed to abort
. If I try to knit the RMarkdown document, it just hangs forever at the block containing the tokenizer code. It does not crash if I try to tokenize the whole tibble instead of just the vector of titles, but that obviously just results in an empty tokenizer. I’ve experienced these issues on two machines both running RStudio 1.2.1335 on Windows 10, but each had different versions of R: 3.5.3 and 3.6.0. All of my latest code is up on my repo but I’ve copied the pertinent section below: