Open rockroland opened 3 years ago
I wanted to follow up and say that I did finally get it to work.
However, I did make some changes that may be useful to others so I am posting it here.
My narrow objective was to just get a case-sensitive word corpus. I wanted a simple csv output with 1 line per word and the total counts for that word across all years like this:
"Abc", 176198 "Xyz", 789 "def", 79806
So I modified your original get_data.cpp so that it disregarded the years and instead summed the counts across all years and gave me 1 line per word along with the total counts for the word:
This made the final output file much smaller and just gave a me a "dictionary corpus" with word frequency. For one of the google ngrams files (~3.5 million ngrams in a typical file) it took just a few seconds to process. For a typical 2020 1 gram file that started out at 1.7GB the resulting wordcount corpus output file was 55MB. A snippet of output for one of the files looks like this:
"word","wordcount"
"Araire",152
"A.R.Ubbelohde",200
"Andcortes_NOUN",75
"Abom",6631
"Anrrich_NOUN",593
"Aosc",475
"Aboso",3760
"
I attached a new .cpp file called get_data_counts.cpp
The complete steps to use this would be: Install r Install rstudio (optional) Install the Rcpp package Install rtools package Install BH package (boost) set r working directory put raw ngram files in that working directory put get_data_counts.cpp file in that working directory
Then at the r command prompt (>) type these commands in order:
library(BH) library(Rcpp) sourceCpp("get_data_counts.cpp") ngrams=get_data("put_an_ngram_filename_here") write.csv(ngrams, "put_output_filename_here.csv", row.names=FALSE)
Some things to note are: when you run sourceCpp("get_data_counts.cpp") to compile the function you will see some warnings from the boost library which can be ignored even though the .cpp file is called get_data_counts the function it creates is still called get_counts the row.names=FALSE option makes a csv without line numbers
Glad you got it to work.
Yes, I could document it better. As an R user, I tend to forget all the things I had to install to run the code. Looking at the readme, I did forget to add the BH dependency.
I'm not sure why you needed the rtools package. Perhaps that's an Rcpp dependency?
Anyway, thanks for the notes. I'll put it on my todo list to make the readme better. And if it makes you feel better, I've spent many hours trying to get Rcpp code to compile properly.
Cheers, Blair
On Thu, Mar 25, 2021 at 10:40 PM rockroland @.***> wrote:
I wanted to follow up and say that I did finally get it to work.
However, I did make some changes that may be useful to others so I am posting it here.
My narrow objective was to just get a case-sensitive word corpus. I wanted a simple csv output with 1 line per word and the total counts for that word across all years like this:
"Abc", 176198 "Xyz", 789 "def", 79806
So I modified your original get_data.cpp so that it disregarded the years and instead summed the counts across all years and gave me 1 line per word along with the total counts for the word:
get_data_counts.zip https://github.com/blairfix/read_ngram/files/6208980/get_data_counts.zip
This made the final output file much smaller and just gave a me a "dictionary corpus" with word frequency. For one of the google ngrams files (~3.5 million ngrams in a typical file) it took just a few seconds to process. For a typical 2020 1 gram file that started out at 1.7GB the resulting wordcount corpus output file was 55MB. A snippet of output for one of the files looks like this:
"word","wordcount" "Araire",152 "A.R.Ubbelohde",200 "Andcortes_NOUN",75 "Abom",6631 "Anrrich_NOUN",593 "Aosc",475 "Aboso",3760 "",146 "890285",132 "Akwal_NOUN",363 "Apagira",71
I attached a new .cpp file called get_data_counts.cpp
The complete steps to use this would be: Install r Install rstudio (optional) Install the Rcpp package Install rtools package Install BH package (boost) set r working directory put raw ngram files in that working directory put get_data_counts.cpp file in that working directory
Then at the r command prompt (>) type these commands in order:
library(BH) library(Rcpp) sourceCpp("get_data_counts.cpp") ngrams=get_data("put_an_ngram_filename_here") write.csv(ngrams, "put_output_filename_here.csv", row.names=FALSE)
Some things to note are: when you run sourceCpp("get_data_counts.cpp") to compile the function you will see some warnings from the boost library which can be ignored even though the .cpp file is called get_data_counts the function it creates is still called get_counts the row.names=FALSE option makes a csv without line numbers
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blairfix/read_ngram/issues/2#issuecomment-807893307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEX5THAEPA7M7COFCPN4KFDTFPX2LANCNFSM4Z2JFXJQ .
First of all, thanks for writing this code and sharing it. Just like you, I was initially flummoxed as to how to efficiently deal with the new ngram format. It seemed like what you created would do the trick but I have spent half a day trying to make it work.
First, you should know that I do not know cpp at all and rarely use r except for occasions like this when I need to get to data that seems to only be accessible using r. Nevertheless, I am still fairly technically fluent so I thought it worth trying.
I installed r (x64) I installed rstudio I removed my old Git for windows installation I installed the rcpp package I installed rtools package
I put your .cpp files in my r working directory
I then tried to execute sourceCpp("get_data.cpp") at the prompt and had errors like: unexpected symbol using namespace Rcpp
I then loaded the rcpp library at the r console prompt:
I tried again but got error like: No such file or directory when looking for .hpp files like string.hpp
So the problem is that in your readme it is not clear that in order to successfully execute: sourceCpp("get_ngrams.cpp") and sourceCpp("get_data.cpp") you need to have boost.
Of course, it is pretty clear that you need r and rcpp but it took me a while to realize that I had to install BH (boost) and make sure that my path environment was correct.
I also had to load both libraries rcpp and BH at the r console command prompt before executing sourceCpp("get_data.cpp"):
I also had to modify get_data.cpp by adding this string at the beginning of the .cpp file: // [[Rcpp::depends(BH)]]
I only came to these realizations after tracking down the errors when I attempted sourceCpp("get_data.cpp")
There were many trial and error intermediate steps I took that I am not mentioning here because the point is just that boost need to be properly installed and referenced
So, in a nutshell, it may make it easier for other less experienced people (like me!) if you were explicit that these things need to be in place ahead of time.
Thanks again for making your solution available to everyone.
Best Regards