OscarKjell / text

Using Transformers from HuggingFace in R
https://r-text.org
130 stars 29 forks source link

Unable to use models that require approval on HuggingFace #146

Open MattCowgill opened 6 months ago

MattCowgill commented 6 months ago

Some models (eg. Google's Gemma family) require authorisation before use. This authorisation is free, immediate, and automatic, but must be done for each user. A similar process is in place for Meta's Llama 2 model.

It currently does not appear possible to use these models with {text}, as there is no way to confirm that a user has been approved to use the model. Is there some workaround for this?

library(text)
#> This is text (version 1.1.1).
#> Text is new and still rapidly improving.
#>                
#> Newer versions may have improved functions and updated defaults to reflect current understandings of the state-of-the-art.
#>                Please send us feedback based on your experience.
#> 
#> Please note that defaults has changed in the textEmbed-functions since last version; see help(textEmbed) or www.r-text.org for more details.

textrpp_initialize()
#> 
#> Successfully initialized text required python packages.
#> 
#> Python options: 
#>  type = "textrpp_condaenv", 
#>  name = "textrpp_condaenv".

some_sentences <- Language_based_assessment_data_8$satisfactiontexts |>  
  head(5) 

# Works
bert_emb <- textEmbed(some_sentences,
          model = "bert-base-uncased")
#> Completed layers output for texts (variable: 1/1, duration: 7.845629 secs).
#> Completed layers aggregation for word_type_embeddings. 
#> Completed layers aggregation (variable 1/1, duration: 3.445685 secs).
#> Completed layers aggregation (variable 1/1, duration: 3.456672 secs).
#> 

# Fails
gemma_emb <- textEmbed(some_sentences,
                       model = "google/gemma-2b")
#> We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like google/gemma-2b is not the path to a directory containing a config.json file.
#> Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

# Also fails
gemma_emb <- textEmbed(some_sentences,
                       model = "gemma-2b")
#> We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like gemma-2b is not the path to a directory containing a config.json file.
#> Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Created on 2024-02-22 with reprex v2.0.2

moomoofarm1 commented 6 months ago

Hi, the text package has been updated. Now you could try it with one of the Llama models. Please use the code below. embed1 <- textEmbed("Hello!",model="meta-llama/Llama-2-7b-chat-hf",hg_gated = TRUE,hg_token = "hf_...")

Because the model is gated, so before try it, use your huggingface account to generate a token in the settings menu in the upper right. Remember you need the permission of the model owner to use the model.

MattCowgill commented 6 months ago

That's great, thanks so much @moomoofarm1! Unfortunately it does not work for me with either google/gemma-2b or meta-llama/Llama-2-7b. I have been authorised to use both models.

The error message indicates that the token is valid and I have successfully logged in to HuggingFace, but the package appears unable to load the models. Please see example below. I have redacted my token.

library(text)
#> This is text (version 1.2.0.7).
#> Text is new and still rapidly improving.
#>                
#> Newer versions may have improved functions and updated defaults to reflect current understandings of the state-of-the-art.
#>                Please send us feedback based on your experience.
#> 
#> Please note that defaults has changed in the textEmbed-functions since last version; see help(textEmbed) or www.r-text.org for more details.

textrpp_initialize()
#> 
#> Successfully initialized text required python packages.
#> 
#> Python options: 
#>  type = "textrpp_condaenv", 
#>  name = "textrpp_condaenv".

sentences <- Language_based_assessment_data_8$satisfactiontexts |>  
  head(5) 

# Works (bert-base-uncased)
textEmbed(sentences)
#> Successfully logged out.
#> Successfully logout to Huggingface!
#> Completed layers output for texts (variable: 1/1, duration: 5.881148 secs).
#> Completed layers aggregation for word_type_embeddings. 
#> Completed layers aggregation (variable 1/1, duration: 2.115880 secs).
#> Completed layers aggregation (variable 1/1, duration: 2.194111 secs).
#> 
#> $tokens
#> $tokens$texts
#> $tokens$texts[[1]]
#> # A tibble: 79 × 769
#>    tokens       Dim1    Dim2    Dim3     Dim4    Dim5    Dim6     Dim7    Dim8
#>    <chr>       <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
#>  1 [CLS]      0.298   0.388  -0.144  -0.817   -0.739  -0.393   0.202    0.600 
#>  2 i          0.381   0.735   0.197  -0.429   -0.714   0.629   0.0953   0.988 
#>  3 am         0.239   0.617   0.0899 -0.777   -0.181   0.454  -0.385    0.945 
#>  4 not       -0.106   0.0888  0.315  -0.513    0.386   0.243  -0.701    1.45  
#>  5 satisfied  0.701  -0.162  -0.0111 -0.218   -0.0840 -0.178  -0.373    0.293 
#>  6 with       0.580   0.0611  0.160  -0.411   -0.185  -0.769  -0.151    0.959 
#>  7 my         0.720   0.742   0.0635 -0.513   -0.830  -0.201   0.720    0.809 
#>  8 life       1.85    0.177   0.261  -0.734    0.942  -0.0581  0.783    0.836 
#>  9 .          0.0569  0.0169 -0.0264 -0.00590 -0.0697 -0.0454 -0.0259  -0.0474
#> 10 [SEP]      0.0298  0.0295  0.0103  0.0277  -0.0503 -0.0635  0.00825 -0.0206
#> # ℹ 69 more rows
#> # ℹ 760 more variables: Dim9 <dbl>, Dim10 <dbl>, Dim11 <dbl>, Dim12 <dbl>,
#> #   Dim13 <dbl>, Dim14 <dbl>, Dim15 <dbl>, Dim16 <dbl>, Dim17 <dbl>,
#> #   Dim18 <dbl>, Dim19 <dbl>, Dim20 <dbl>, Dim21 <dbl>, Dim22 <dbl>,
#> #   Dim23 <dbl>, Dim24 <dbl>, Dim25 <dbl>, Dim26 <dbl>, Dim27 <dbl>,
#> #   Dim28 <dbl>, Dim29 <dbl>, Dim30 <dbl>, Dim31 <dbl>, Dim32 <dbl>,
#> #   Dim33 <dbl>, Dim34 <dbl>, Dim35 <dbl>, Dim36 <dbl>, Dim37 <dbl>, …
#> 
#> $tokens$texts[[2]]
#> # A tibble: 95 × 769
#>    tokens        Dim1   Dim2    Dim3     Dim4    Dim5    Dim6    Dim7    Dim8
#>    <chr>        <dbl>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 [CLS]       0.360  0.0597 -0.277  -0.278   -0.575  -0.264  -0.0316  0.218 
#>  2 i          -0.156  0.225   0.0913  0.546   -0.609   0.888   0.368   0.405 
#>  3 am         -0.434  0.344  -0.0923  0.0631   0.132   0.638   0.228   0.643 
#>  4 definitely  0.812  0.190   0.869   0.00139  0.926   0.434   0.434  -0.0868
#>  5 pretty     -0.607  0.132   0.851  -0.146   -0.526   0.0246 -0.215   0.410 
#>  6 satisfied   0.568  0.285   0.259  -0.207    0.189  -0.227  -0.0341 -0.636 
#>  7 right      -1.22   0.429   0.718   0.416   -0.927   0.0383 -1.37    0.368 
#>  8 now        -0.799  0.0608 -0.132  -0.0463  -0.466   0.538  -1.07    0.498 
#>  9 .           0.0403 0.0220 -0.0171  0.0179  -0.0433 -0.0560 -0.0217 -0.0530
#> 10 [SEP]      -0.107  0.159   0.147   0.263   -0.0545  0.0586  0.251  -0.0749
#> # ℹ 85 more rows
#> # ℹ 760 more variables: Dim9 <dbl>, Dim10 <dbl>, Dim11 <dbl>, Dim12 <dbl>,
#> #   Dim13 <dbl>, Dim14 <dbl>, Dim15 <dbl>, Dim16 <dbl>, Dim17 <dbl>,
#> #   Dim18 <dbl>, Dim19 <dbl>, Dim20 <dbl>, Dim21 <dbl>, Dim22 <dbl>,
#> #   Dim23 <dbl>, Dim24 <dbl>, Dim25 <dbl>, Dim26 <dbl>, Dim27 <dbl>,
#> #   Dim28 <dbl>, Dim29 <dbl>, Dim30 <dbl>, Dim31 <dbl>, Dim32 <dbl>,
#> #   Dim33 <dbl>, Dim34 <dbl>, Dim35 <dbl>, Dim36 <dbl>, Dim37 <dbl>, …
#> 
#> $tokens$texts[[3]]
#> # A tibble: 36 × 769
#>    tokens      Dim1    Dim2    Dim3     Dim4     Dim5    Dim6     Dim7     Dim8
#>    <chr>      <dbl>   <dbl>   <dbl>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
#>  1 [CLS]     0.250   0.141   0.110  -0.667   -0.370   -0.234   0.0473   0.588  
#>  2 i         0.230   0.233   0.683  -0.450   -0.264    1.03    0.132    0.999  
#>  3 am        0.262   0.327   0.364  -0.720    0.603    0.850  -0.622    0.772  
#>  4 very      0.379  -0.473   1.24   -0.258    0.766    0.392   0.0325   0.696  
#>  5 much      0.343  -0.514   0.992  -0.742    0.397    0.584  -0.668    0.225  
#>  6 satisfied 1.18    0.456   0.322  -0.499    1.06    -0.157  -0.309   -0.189  
#>  7 .         0.0446  0.0207 -0.0248  0.00119 -0.0358  -0.0437 -0.0176  -0.0554 
#>  8 [SEP]     0.0376 -0.0132  0.0790  0.0160  -0.00191 -0.0223  0.00496 -0.00711
#>  9 [CLS]     0.558   0.388  -0.0219 -0.342   -0.652   -0.621   0.140    0.998  
#> 10 i         0.981   0.286   0.301  -0.350   -0.516    0.269   0.409    1.27   
#> # ℹ 26 more rows
#> # ℹ 760 more variables: Dim9 <dbl>, Dim10 <dbl>, Dim11 <dbl>, Dim12 <dbl>,
#> #   Dim13 <dbl>, Dim14 <dbl>, Dim15 <dbl>, Dim16 <dbl>, Dim17 <dbl>,
#> #   Dim18 <dbl>, Dim19 <dbl>, Dim20 <dbl>, Dim21 <dbl>, Dim22 <dbl>,
#> #   Dim23 <dbl>, Dim24 <dbl>, Dim25 <dbl>, Dim26 <dbl>, Dim27 <dbl>,
#> #   Dim28 <dbl>, Dim29 <dbl>, Dim30 <dbl>, Dim31 <dbl>, Dim32 <dbl>,
#> #   Dim33 <dbl>, Dim34 <dbl>, Dim35 <dbl>, Dim36 <dbl>, Dim37 <dbl>, …
#> 
#> $tokens$texts[[4]]
#> # A tibble: 66 × 769
#>    tokens     Dim1    Dim2     Dim3    Dim4    Dim5     Dim6    Dim7    Dim8
#>    <chr>     <dbl>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
#>  1 [CLS]  -0.342    0.287  -0.377   -0.254  -0.496  -0.826   -0.146   0.627 
#>  2 i       0.0336   0.459  -0.558    0.310   0.0898  0.145   -0.370   0.720 
#>  3 feel   -0.324    0.937   0.250    0.302   0.272   0.00828 -0.587  -0.0795
#>  4 lost   -0.303    0.490   1.10    -0.322   1.08   -0.827   -1.11    0.988 
#>  5 .      -0.262   -0.297  -0.663    0.308  -0.249  -0.278   -0.505   0.545 
#>  6 [SEP]   0.00297  0.0184 -0.0196   0.0217 -0.0391 -0.103   -0.0172 -0.0246
#>  7 [CLS]   0.307    0.297  -0.249   -0.696  -0.605  -0.536   -0.100   0.126 
#>  8 i       0.276    0.550   0.0344   0.526   0.111   0.826   -0.180   0.366 
#>  9 don     0.581    1.38    0.140    0.0684  0.342   0.0500  -0.668   0.880 
#> 10 '       0.0537   0.0196 -0.00823  0.0235 -0.0361 -0.0607  -0.0188 -0.0431
#> # ℹ 56 more rows
#> # ℹ 760 more variables: Dim9 <dbl>, Dim10 <dbl>, Dim11 <dbl>, Dim12 <dbl>,
#> #   Dim13 <dbl>, Dim14 <dbl>, Dim15 <dbl>, Dim16 <dbl>, Dim17 <dbl>,
#> #   Dim18 <dbl>, Dim19 <dbl>, Dim20 <dbl>, Dim21 <dbl>, Dim22 <dbl>,
#> #   Dim23 <dbl>, Dim24 <dbl>, Dim25 <dbl>, Dim26 <dbl>, Dim27 <dbl>,
#> #   Dim28 <dbl>, Dim29 <dbl>, Dim30 <dbl>, Dim31 <dbl>, Dim32 <dbl>,
#> #   Dim33 <dbl>, Dim34 <dbl>, Dim35 <dbl>, Dim36 <dbl>, Dim37 <dbl>, …
#> 
#> $tokens$texts[[5]]
#> # A tibble: 110 × 769
#>    tokens       Dim1     Dim2     Dim3     Dim4    Dim5    Dim6    Dim7    Dim8
#>    <chr>       <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 [CLS]      0.238  -0.207   -0.135   -0.485   -0.473  -0.251  -0.0176  0.806 
#>  2 in        -0.0221  0.443    0.336   -0.557   -0.515   0.361  -0.311   1.21  
#>  3 general    0.340   0.811    0.00655  0.0631  -1.01   -0.273  -0.448   0.677 
#>  4 i          0.417  -0.0146   0.289   -0.169   -0.324   0.643   0.0900  1.57  
#>  5 am         0.269   0.141    0.163   -0.581    0.109   0.657  -0.595   1.47  
#>  6 very       0.0738  0.0154   0.755   -0.516   -0.328   0.311  -0.793   1.24  
#>  7 satisfied  1.07    0.513   -0.0973  -0.278    0.721  -0.0259 -0.259  -0.147 
#>  8 .          0.0573  0.0174  -0.0303   0.00726 -0.0422 -0.0397 -0.0316 -0.0541
#>  9 [SEP]      0.0616 -0.00605  0.0133   0.0241  -0.0346 -0.0351 -0.0227 -0.0158
#> 10 [CLS]      0.429  -0.375   -0.0654  -0.0470  -0.148  -0.232  -0.0969  0.390 
#> # ℹ 100 more rows
#> # ℹ 760 more variables: Dim9 <dbl>, Dim10 <dbl>, Dim11 <dbl>, Dim12 <dbl>,
#> #   Dim13 <dbl>, Dim14 <dbl>, Dim15 <dbl>, Dim16 <dbl>, Dim17 <dbl>,
#> #   Dim18 <dbl>, Dim19 <dbl>, Dim20 <dbl>, Dim21 <dbl>, Dim22 <dbl>,
#> #   Dim23 <dbl>, Dim24 <dbl>, Dim25 <dbl>, Dim26 <dbl>, Dim27 <dbl>,
#> #   Dim28 <dbl>, Dim29 <dbl>, Dim30 <dbl>, Dim31 <dbl>, Dim32 <dbl>,
#> #   Dim33 <dbl>, Dim34 <dbl>, Dim35 <dbl>, Dim36 <dbl>, Dim37 <dbl>, …
#> 
#> 
#> 
#> $texts
#> $texts$texts
#> # A tibble: 5 × 768
#>   Dim1_texts Dim2_texts Dim3_texts Dim4_texts Dim5_texts Dim6_texts Dim7_texts
#>        <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1      0.340      0.224      0.149     -0.235     0.0894     -0.203     0.0431
#> 2      0.153      0.403      0.273     -0.281    -0.0164     -0.115     0.0294
#> 3      0.524      0.199      0.292     -0.304     0.305      -0.302     0.0974
#> 4      0.258      0.319     -0.123     -0.189     0.191      -0.232     0.0610
#> 5      0.316      0.326      0.351     -0.151     0.0276     -0.121     0.0231
#> # ℹ 761 more variables: Dim8_texts <dbl>, Dim9_texts <dbl>, Dim10_texts <dbl>,
#> #   Dim11_texts <dbl>, Dim12_texts <dbl>, Dim13_texts <dbl>, Dim14_texts <dbl>,
#> #   Dim15_texts <dbl>, Dim16_texts <dbl>, Dim17_texts <dbl>, Dim18_texts <dbl>,
#> #   Dim19_texts <dbl>, Dim20_texts <dbl>, Dim21_texts <dbl>, Dim22_texts <dbl>,
#> #   Dim23_texts <dbl>, Dim24_texts <dbl>, Dim25_texts <dbl>, Dim26_texts <dbl>,
#> #   Dim27_texts <dbl>, Dim28_texts <dbl>, Dim29_texts <dbl>, Dim30_texts <dbl>,
#> #   Dim31_texts <dbl>, Dim32_texts <dbl>, Dim33_texts <dbl>, …

# Fails (no token)
textEmbed(sentences,
          model = "google/gemma-2b")
#> We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like google/gemma-2b is not the path to a directory containing a config.json file.
#> Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

# Fails (token is valid, cannot find model)
textEmbed(sentences,
          model = "google/gemma-2b",
          hg_gated = TRUE,
          hg_token = "REDACTED_TOKEN")
#> Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
#> Token is valid (permission: read).
#> Your token has been saved to /Users/mcowgill/.cache/huggingface/token
#> Login successful
#> Successfully login to Huggingface!
#> We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like google/gemma-2b is not the path to a directory containing a config.json file.
#> Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

# Fails (token is valid, cannot find model)
textEmbed(sentences,
          model = "meta-llama/Llama-2-7b",
          hg_gated = TRUE,
          hg_token = "REDACTED_TOKEN")
#> Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
#> Token is valid (permission: read).
#> Your token has been saved to /Users/mcowgill/.cache/huggingface/token
#> Login successful
#> Successfully login to Huggingface!
#> We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like meta-llama/Llama-2-7b is not the path to a directory containing a config.json file.
#> Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Created on 2024-02-28 with reprex v2.0.2

MattCowgill commented 6 months ago

Separately, I believe it would useful to allow users to set an environment variable HUGGINGFACE_TOKEN rather than requiring them to enter the token each time they invoke textEmbed() or similar functions. I will put through a PR that makes this change.

MattCowgill commented 6 months ago

The PR linked above adds the environment variable functionality.

Please note it does not address the substantive issue, namely that accessing gated models is not working for me.

moomoofarm1 commented 6 months ago

Gemma needs python support. I think this family would be available after an update in huggingface transformers package. And it will be available soon due to the newest transformers package has just been released.

As for the Llama-7b, we are still looking for reasons. Look forward :).

MattCowgill commented 6 months ago

Thanks @moomoofarm1, I look forward to it