EmilHvitfeldt / textdata

Download, parse, store, and load text datasets instead of storing it in packages
https://emilhvitfeldt.github.io/textdata/
Other
75 stars 13 forks source link

lexicon_nrc_vad() is currently malformatted #56

Closed sjentsch closed 4 months ago

sjentsch commented 5 months ago

The original data file doesn't seem to contain headers.

textdata::lexicon_nrc_vad()
# A tibble: 19,970 × 4
   aaaaaaah    `0.479` `0.606` `0.291`
   <chr>         <dbl>   <dbl>   <dbl>
 1 aaaah         0.52    0.636   0.282
 2 aardvark      0.427   0.49    0.437
 3 aback         0.385   0.407   0.288
 4 abacus        0.51    0.276   0.485
EmilHvitfeldt commented 4 months ago

Hello @sjentsch 👋

Could you delete the dataset and try again? I'm not able to reproduce your results

textdata::lexicon_nrc_vad(delete = TRUE)
sjentsch commented 4 months ago

Yes, I did. The result is the same. I also tried on two different machines.

sjentsch commented 4 months ago

And it happens independent of R version and OS. I tried Linux with R 4.4 (the two machines mentioned above) plus on a Windows machine with R 4.3.

sjentsch commented 4 months ago

The version that currently is on CRAN produces the error. There, textdata:::process_nrc_vad looks like this:

function (folder_path, name_path) 
{
    data <- read_tsv(path(folder_path, "NRC-VAD-Lexicon-Aug2018Release/NRC-VAD-Lexicon.txt"), 
        col_types = cols(Word = col_character(), Valence = col_double(), 
            Arousal = col_double(), Dominance = col_double()))
    write_rds(data, name_path)
}
<bytecode: 0x5a2c82c537a0>
<environment: namespace:textdata>

The version on Github has the error fixed. It was changed with this commit: https://github.com/EmilHvitfeldt/textdata/commit/c1f0726e79b1a4896888249713a3396c3cd699d1

But obviously the version on CRAN wasn't updated.

EmilHvitfeldt commented 4 months ago

The package has been updated on CRAN!

sjentsch commented 4 months ago

Thanks, Emil!