OscarKjell / text

Using Transformers from HuggingFace in R
https://r-text.org
139 stars 31 forks source link

ClassifiedText return an error #84

Closed massimoaria closed 3 weeks ago

massimoaria commented 1 year ago

Trying to use a classify task model with this simple text (an abstract of a scientific paper), the function returns an error. Using other abstracts as examples the function works well.

Here the reprex of the issue:

library(text)
#> This is text (version 1.1.1).
#> Text is new and still rapidly improving.
#>                
#> Newer versions may have improved functions and updated defaults to reflect current understandings of the state-of-the-art.
#>                Please send us feedback based on your experience.
#> 
#> Please note that defaults has changed in the textEmbed-functions since last version; see help(textEmbed) or www.r-text.org for more details.

x <- "OBJECTIVE TO ASSESS THE OVERALL TRENDS IN THE DEVELOPMENT AND CITATION IMPACT OF HIGH-IMPACT PAPERS IN NURSING RESEARCH WORLDWIDE TO GAIN INSIGHT INTO THE FOCUS AREAS OF NURSING RESEARCH. BACKGROUND BIBLIOMETRIC METHOD IS PROVED TO BE EFFECTIVE IN ANALYSING THE PAPERS' CHARACTERISTICS, AND IT GAINED CONSIDERABLE INTEREST FROM THE SCIENTIFIC COMMUNITY IN RECENT YEARS. AN ANALYSIS OF THE CHARACTERISTICS AND INTRINSIC PATTERNS OF HIGH-IMPACT PAPERS IN NURSING RESEARCH WILL PROVIDE AN OBJECTIVE REFLECTION OF THE RESEARCH HOT SPOTS. NURSING MANAGERS CAN POINTEDLY INCREASE FUNDING AMOUNT AND STRENGTHEN RESEARCH COOPERATION IN ORDER TO PUT THE SCIENTIFIC RESULTS INTO MANAGEMENT PRACTICE. METHODS BIBLIOMETRIC METHODS AND VISUALIZATION SOFTWARE WERE USED TO COMPREHENSIVELY ANALYSE HIGH-IMPACT PAPERS IN NURSING RESEARCH IN TERMS OF DEVELOPMENT TRENDS, COUNTRIES/REGIONS, DISTRIBUTION OF SUBJECT AREAS, RESEARCH INSTITUTES, COLLABORATIVE NETWORKS AND SUBJECT TERMS. RESULTS THERE WERE 6,886 PAPERS BETWEEN 2008 AND 2018. THE NUMBER OF PAPERS INCREASED FROM 528 IN 2008 TO 723 IN 2015, AND THEN REMAINED ABOVE 600 IN 2016 AND 2017. THESE PAPERS WERE MAINLY DISTRIBUTED IN NURSING, ONCOLOGY, PAEDIATRICS, GYNAECOLOGY, TEACHING AND EDUCATION, AND CARDIAC AND CARDIOVASCULAR SYSTEMS AND WERE CITED BY 128,845 PAPERS THAT CAME FROM 89 WEB OF SCIENCE SUBJECT AREAS. PAPERS IN NURSING RESEARCH ACCOUNTED FOR THE LARGEST SHARE OF THESE CITATIONS. THE TOP FIVE COUNTRIES IN THE WORLD IN TERMS OF THE NUMBER OF HIGH-IMPACT PAPERS WERE THE UNITED STATES, AUSTRALIA, THE UNITED KINGDOM, CANADA AND SWEDEN. THE RESEARCH INSTITUTIONS WITH THE HIGHEST NUMBER OF HIGH-IMPACT PAPERS WORLDWIDE WERE THE UNIVERSITY OF CALIFORNIA SYSTEM, THE UNIVERSITY OF PENNSYLVANIA, THE UNIVERSITY OF NORTH CAROLINA, THE UNIVERSITY OF LONDON AND THE UNIVERSITY OF TECHNOLOGY SYDNEY. IN THIS DATA SET, IT WAS SHOWN THAT RESEARCH COLLABORATIVE CIRCLES HAVE BEEN FORMED IN THE UNITED STATES, AUSTRALIA, CANADA AND EUROPE; THE SUBJECT-TERM ANALYSIS INDICATED THAT 'WOMEN' AND 'STUDENTS' HAVE ALWAYS BEEN HIGH-INTEREST POPULATIONS FOR HIGH-IMPACT PAPERS AND THAT CANCER IS STILL ONE OF THE GREATEST THREATS TO HUMAN HEALTH. FURTHERMORE, THE SUBJECT TERMS OF HIGH-IMPACT PAPERS IN NURSING RESEARCH HAVE GRADUALLY EVOLVED FROM 'DISEASE' AND 'THERAPY' TO 'SYMPTOMS'. CONCLUSION IN RECENT YEARS, THE NUMBER OF HIGH-IMPACT PAPERS PUBLISHED EACH YEAR IN NURSING RESEARCH HAS GROWN OVER TIME. NURSING HAS BEEN SHOWN TO BE A HIGHLY SPECIALIZED SUBJECT, AND THE MAJORITY OF ITS HIGH-IMPACT PAPERS HAVE BEEN PUBLISHED BY RESEARCH INSTITUTIONS. ALTHOUGH CROSS-REGIONAL COLLABORATIONS ARE BEGINNING TO EMERGE, THERE IS MUCH ROOM FOR IMPROVEMENT IN THIS REGARD. FINALLY, WOMEN, STUDENTS, CANCER AND SYMPTOMATIC CARE ARE THE CURRENT FOCUS AREAS IN NURSING RESEARCH. IMPLICATIONS FOR NURSING MANAGEMENT THIS STUDY INFORMS NURSING MANAGERS WITHIN THE NURSING RESEARCH FIELD ABOUT SUBJECT AREAS, COLLABORATIVE NETWORKS AND HOT TOPICS. IT IS BENEFICIAL TO PAY ATTENTION TO STUDIES, MANAGE SCIENTIFIC OUTPUTS, ALLOCATE RESOURCES, SEEK COOPERATION AND IMPROVE THE WORK EFFICIENCY OF SCIENTIFIC RESEARCH MANAGEMENT."

classifiedText <- textClassify(
  x,
  model = "distilbert-base-uncased-finetuned-sst-2-english",
  device = "cpu",
  tokenizer_parallelism = T,
  logging_level = "error",
  return_incorrect_results = FALSE,
  return_all_scores = FALSE,
  function_to_apply = "none",
  set_seed = 202208
)

Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  RuntimeError: The size of tensor a (582) must match the size of tensor b (512) at non-singleton dimension 1
Run `reticulate::py_last_error()` for details.

Created on 2023-11-02 with reprex v2.0.2

massimoaria commented 1 year ago

The issue also appears when using different models, i.e. "SamLowe/roberta-base-go_emotions", "nlptown/bert-base-multilingual-uncased-sentiment", "roberta-large-mnli", etc.

CarlViggo commented 1 year ago

Models like BERT, RoBERTa, etc. all take a max sequence length of 512 tokens. Check out this thread for more info: https://discuss.huggingface.co/t/longformer-and-sentiment-analysis/9416