allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.05k stars 276 forks source link

led-base-16384 summary results #154

Open mmoya01 opened 3 years ago

mmoya01 commented 3 years ago

Hello, I’m trying to use LED for long document summarization. That said, I tried: pip installing the Longformer repo, following LED’s summarization.py 1 and downloading the led-base-16384 pretrained model. I noticed that it referenced arxiv articles so my input_data below is one of those articles. However, after running the snippet below:

import logging
import json
import os
import pandas as pd
import torch
from longformer import LongformerEncoderDecoderConfig,LongformerEncoderDecoderForConditionalGeneration
from longformer.sliding_chunks import pad_to_window_size
from transformers import LongformerTokenizer

logger = logging.getLogger(__name__)

ATTENTION_WINDOW = 512
ATTENTION_MODE = 'sliding_chunks'
DROPOUT = 0.1
SMOOTHING = 0.0
MODEL_PATH = 'longformer-encdec-base-16384/'

def _prepare_input(input_ids):
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
    attention_mask[input_ids == tokenizer.pad_token_id] = 0
    if isinstance(model, LongformerEncoderDecoderForConditionalGeneration):
        attention_mask[:, 0] = 2  # global attention on one token for all model params to be used, which is important for gradient checkpointing to work
        if config.attention_mode == 'sliding_chunks':
            half_padding_mod = model.config.attention_window[0]
        elif config.attention_mode == 'sliding_chunks_no_overlap':
            half_padding_mod = model.config.attention_window[0] / 2
        else:
            raise NotImplementedError
        input_ids, attention_mask = pad_to_window_size(  # ideally, should be moved inside the LongformerModel
            input_ids, attention_mask, half_padding_mod, tokenizer.pad_token_id)
    return input_ids, attention_mask        

if __name__ == "__main__":
    device = torch.device('cpu') 
    logger.info('load data')
    input_data = """
    'additive models @xcite provide an important family of models for semiparametric regression or classification . some reasons for the success of additive models are their increased flexibility when compared to linear or generalized linear models and their increased interpretability when compared to fully nonparametric models .it is well - known that good estimators in additive models are in general less prone to the curse of high dimensionality than good estimators in fully nonparametric models .many examples of such estimators belong to the large class of regularized kernel based methods over a reproducing kernel hilbert space @xmath0 , see e.g. @xcite . in the last yearsmany interesting results on learning rates of regularized kernel based models for additive models have been published when the focus is on sparsity and when the classical least squares loss function is used , see e.g. @xcite , @xcite , @xcite , @xcite , @xcite , @xcite and the references therein . of course , the least squares loss function is differentiable and has many nice mathematical properties , but it is only locally lipschitz continuous and therefore regularized kernel based methods based on this loss function typically suffer on bad statistical robustness properties , even if the kernel is bounded .this is in sharp contrast to kernel methods based on a lipschitz continuous loss function and on a bounded loss function , where results on upper bounds for the maxbias bias and on a bounded influence function are known , see e.g. @xcite for the general case and @xcite for additive models .therefore , we will here consider the case of regularized kernel based methods based on a general convex and lipschitz continuous loss function , on a general kernel , and on the classical regularizing term @xmath1 for some @xmath2 which is a smoothness penalty but not a sparsity penalty , see e.g. @xcite .such regularized kernel based methods are now often called support vector machines ( svms ) , although the notation was historically used for such methods based on the special hinge loss function and for special kernels only , we refer to @xcite .    in this paper we address the open question , whether an svm with an additive kernel can provide a substantially better learning rate in high dimensions than an svm with a general kernel , say a classical gaussian rbf kernel , if the assumption of an additive model is satisfied .our leading example covers learning rates for quantile regression based on the lipschitz continuous but non - differentiable pinball loss function , which is also called check function in the literature , see e.g. @xcite and @xcite for parametric quantile regression and @xcite , @xcite , and @xcite for kernel based quantile regression .we will not address the question how to check whether the assumption of an additive model is satisfied because this would be a topic of a paper of its own .of course , a practical approach might be to fit both models and compare their risks evaluated for test data .for the same reason we will also not cover sparsity .consistency of support vector machines generated by additive kernels for additive models was considered in @xcite . in this paperwe establish learning rates for these algorithms .let us recall the framework with a complete separable metric space @xmath3 as the input space and a closed subset @xmath4 of @xmath5 as the output space .a borel probability measure @xmath6 on @xmath7 is used to model the learning problem and an independent and identically distributed sample @xmath8 is drawn according to @xmath6 for learning .a loss function @xmath9 is used to measure the quality of a prediction function @xmath10 by the local error @xmath11 ._ throughout the paper we assume that @xmath12 is measurable , @xmath13 , convex with respect to the third variable , and uniformly lipschitz continuous satisfying @xmath14 with a finite constant @xmath15 ._    support vector machines ( svms ) considered here are kernel - based regularization schemes in a reproducing kernel hilbert space ( rkhs ) @xmath0 generated by a mercer kernel @xmath16 . with a shifted loss function @xmath17 introduced for dealingeven with heavy - tailed distributions as @xmath18 , they take the form @xmath19 where for a general borel measure @xmath20 on @xmath21 , the function @xmath22 is defined by @xmath23 where @xmath24 is a regularization parameter .the idea to shift a loss function has a long history , see e.g. @xcite in the context of m - estimators .it was shown in @xcite that @xmath22 is also a minimizer of the following optimization problem involving the original loss function @xmath12 if a minimizer exists : @xmath25    the additive model we consider consists of the _ input space decomposition _@xmath26 with each @xmath27 a complete separable metric space and a _ hypothesis space _@xmath28 where @xmath29 is a set of functions @xmath30 each of which is also identified as a map @xmath31 from @xmath3 to @xmath5 .hence the functions from @xmath32 take the additive form @xmath33 .we mention , that there is strictly speaking a notational problem here , because in the previous formula each quantity @xmath34 is an element of the set @xmath35 which is a subset of the full input space @xmath36 , @xmath37 , whereas in the definition of sample @xmath8 each quantity @xmath38 is an element of the full input space @xmath36 , where @xmath39 .because these notations will only be used in different places and because we do not expect any misunderstandings , we think this notation is easier and more intuitive than specifying these quantities with different symbols .the additive kernel @xmath40 is defined in terms of mercer kernels @xmath41 on @xmath27 as @xmath42 it generates an rkhs @xmath0 which can be written in terms of the rkhs @xmath43 generated by @xmath41 on @xmath27 corresponding to the form ( [ additive ] ) as @xmath44 with norm given by @xmath45 the norm of @xmath46 satisfies @xmath47    to illustrate advantages of additive models , we provide two examples of comparing additive with product kernels .the first example deals with gaussian rbf kernels .all proofs will be given in section [ proofsection ] .[ gaussadd ] let @xmath48 , @xmath49 $ ] and @xmath50 ^ 2.$ ] let @xmath51 and @xmath52.\\ ] ] the additive kernel @xmath53 is given by @xmath54 furthermore , the product kernel @xmath55 is the standard gaussian kernel given by @xmath56 define a gaussian function @xmath57 on @xmath58 ^ 2 $ ] depending only on one variable by @xmath59 then @xmath60 but @xmath61 where @xmath62 denotes the rkhs generated by the standard gaussian rbf kernel @xmath63 .the second example is about sobolev kernels .[ sobolvadd ] let @xmath64 , @xmath65 $ ] and @xmath58^s.$ ] let @xmath66 : = \\bigl\\{u\\in l_2([0,1 ] ) ; d^\\alpha u \\in l_2([0,1 ] ) \\mbox{~for~all~}|\\alpha|\\le 1\\bigr\\}\\ ] ] be the sobolev space consisting of all square integrable univariate functions whose derivative is also square integrable .it is an rkhs with a mercer kernel @xmath67 defined on @xmath68 ^ 2 $ ] .if we take all the mercer kernels @xmath69 to be @xmath67 , then @xmath70 $ ] for each @xmath71 .the additive kernel @xmath72 is also a mercer kernel and defines an rkhs @xmath73\\right\\}.\\ ] ] however , the multivariate sobolev space @xmath74^s)$ ] , consisting of all square integrable functions whose partial derivatives are all square integrable , contains discontinuous functions and is not an rkhs .denote the marginal distribution of @xmath6 on @xmath27 as @xmath75 . under the assumption that @xmath76 for each @xmath71 and that @xmath43 is dense in @xmath29 in the @xmath77-metric , it was proved in @xcite that @xmath78 in probability as long as @xmath79 satisfies @xmath80 and @xmath81 .the rest of the paper has the following structure .section [ ratessection ] contains our main results on learning rates for svms based on additive kernels . learning rates for quantile regressionare treated as important special cases .section [ comparisonsection ] contains a comparison of our results with other learning rates published recently .section [ proofsection ] contains all the proofs and some results which can be interesting in their own .in this paper we provide some learning rates for the support vector machines generated by additive kernels for additive models which helps improve the quantitative understanding presented in @xcite .the rates are about asymptotic behaviors of the excess risk @xmath82 and take the form @xmath83 with @xmath84 .they will be stated under three kinds of conditions involving the hypothesis space @xmath0 , the measure @xmath6 , the loss @xmath12 , and the choice of the regularization parameter @xmath85 .the first condition is about the approximation ability of the hypothesis space @xmath0 .since the output function @xmath19 is from the hypothesis space , the learning rates of the learning algorithm depend on the approximation ability of the hypothesis space @xmath0 with respect to the optimal risk @xmath86 measured by the following approximation error .[ defapprox ] the approximation error of the triple @xmath87 is defined as @xmath88    to estimate the approximation error , we make an assumption about the minimizer of the risk @xmath89    for each @xmath90 , define the integral operator @xmath91 associated with the kernel @xmath41 by @xmath92 we mention that @xmath93 is a compact and positive operator on @xmath94 . hence we can find its normalized eigenpairs @xmath95 such that @xmath96 is an orthonormal basis of @xmath94 and @xmath97 as @xmath98 . fix @xmath99 .then we can define the @xmath100-th power @xmath101 of @xmath93 by @xmath102 this is a positive and bounded operator and its range is well - defined .the assumption @xmath103 means @xmath104 lies in this range .[ assumption1 ] we assume @xmath105 and @xmath106 where for some @xmath107 and each @xmath108 , @xmath109 is a function of the form @xmath110 with some @xmath111 .the case @xmath112 of assumption [ assumption1 ] means each @xmath113 lies in the rkhs @xmath43 .a standard condition in the literature ( e.g. , @xcite ) for achieving decays of the form @xmath114 for the approximation error ( [ approxerrordef ] ) is @xmath115 with some @xmath116 . herethe operator @xmath117 is defined by @xmath118 in general , this can not be written in an additive form .however , the hypothesis space ( [ additive ] ) takes an additive form @xmath119 .so it is natural for us to impose an additive expression @xmath120 for the target function @xmath121 with the component functions @xmath113 satisfying the power condition @xmath110 .the above natural assumption leads to a technical difficulty in estimating the approximation error : the function @xmath113 has no direct connection to the marginal distribution @xmath122 projected onto @xmath27 , hence existing methods in the literature ( e.g. , @xcite ) can not be applied directly .note that on the product space @xmath123 , there is no natural probability measure projected from @xmath6 , and the risk on @xmath124 is not defined .    our idea to overcome the difficulty is to introduce an intermediate function @xmath125 .it may not minimize a risk ( which is not even defined ) .however , it approximates the component function @xmath113 well .when we add up such functions @xmath126 , we get a good approximation of the target function @xmath121 , and thereby a good estimate of the approximation error .this is the first novelty of the paper .[ approxerrorthm ] under assumption [ assumption1 ] , we have @xmath127 where @xmath128 is the constant given by @xmath129      the second condition for our learning rates is about the capacity of the hypothesis space measured by @xmath130-empirical covering numbers .    let @xmath131 be a set of functions on @xmath21 and @xmath132 for every @xmath133 the * covering number of @xmath131 * with respect to the empirical metric @xmath134 , given by @xmath135 is defined as @xmath136 and the * @xmath130-empirical covering number * of @xmath137 is defined as @xmath138    [ assumption2 ] we assume @xmath139 and that for some @xmath140 , @xmath141 and every @xmath142 , the @xmath130-empirical covering number of the unit ball of @xmath43 satisfies @xmath143    the second novelty of this paper is to observe that the additive nature of the hypothesis space yields the following nice bound with a dimension - independent power exponent for the covering numbers of the balls of the hypothesis space @xmath0 , to be proved in section [ samplesection ] .[ capacitythm ] under assumption [ assumption2 ] , for any @xmath144 and @xmath145 , we have @xmath146    the bound for the covering numbers stated in theorem [ capacitythm ] is special : the power @xmath147 is independent of the number @xmath148 of the components in the additive model .it is well - known @xcite in the literature of function spaces that the covering numbers of balls of the sobolev space @xmath149 on the cube @xmath150^s$ ] of the euclidean space @xmath151 with regularity index @xmath152 has the following asymptotic behavior with @xmath153 : @xmath154 here the power @xmath155 depends linearly on the dimension @xmath148 .similar dimension - dependent bounds for the covering numbers of the rkhss associated with gaussian rbf - kernels can be found in @xcite .the special bound in theorem [ capacitythm ] demonstrates an advantage of the additive model in terms of capacity of the additive hypothesis space .the third condition for our learning rates is about the noise level in the measure @xmath6 with respect to the hypothesis space . before stating the general condition, we consider a special case for quantile regression , to illustrate our general results .let @xmath156 be a quantile parameter .the quantile regression function @xmath157 is defined by its value @xmath158 to be a @xmath159-quantile of @xmath160 , i.e. , a value @xmath161 satisfying @xmath162 the regularization scheme for quantile regression considered here takes the form ( [ algor ] ) with the loss function @xmath12 given by the pinball loss as @xmath163    a noise condition on @xmath6 for quantile regression is defined in @xcite as follows . to this end , let @xmath164 be a probability measure on @xmath165 and @xmath166 . then a real number @xmath167 is called @xmath159-quantile of @xmath164 , if and only if @xmath167 belongs to the set @xmath168\\bigr ) \\ge\\tau     \\mbox{~~and~~ } q\\bigl([t , \\infty)\\bigr ) \\ge 1-\\tau\\bigr\\}\\,.\\ ] ] it is well - known that @xmath169 is a compact interval .[ noisecond ] let @xmath166 .    1 .a probability measure @xmath164 on @xmath165 is said to have a * @xmath159-quantile of type @xmath170 * , if there exist a @xmath159-quantile @xmath171 and a constant @xmath172 such that , for all @xmath173 $ ] , we have @xmath174 2 .let @xmath175 $ ] .we say that a probability measure @xmath20 on @xmath176 has a * @xmath159-quantile of @xmath177-average type @xmath170 * if the conditional probability measure @xmath178 has @xmath179-almost surely a @xmath159-quantile of type @xmath170 and the function @xmath180 where @xmath181 is the constant defined in part ( 1 ) , satisfies @xmath182 .one can show that a distribution @xmath164 having a @xmath159-quantile of type @xmath170 has a unique @xmath159-quantile @xmath183 .moreover , if @xmath164 has a lebesgue density @xmath184 then @xmath164 has a @xmath159-quantile of type @xmath170 if @xmath184 is bounded away from zero on @xmath185 $ ] since we can use @xmath186\\}$ ] in ( [ tauquantileoftype2formula ] ) .this assumption is general enough to cover many distributions used in parametric statistics such as gaussian , student s @xmath187 , and logistic distributions ( with @xmath188 ) , gamma and log - normal distributions ( with @xmath189 ) , and uniform and beta distributions ( with @xmath190 $ ] ) .the following theorem , to be proved in section [ proofsection ] , gives a learning rate for the regularization scheme ( [ algor ] ) in the special case of quantile regression .[ quantilethm ] suppose that @xmath191 almost surely for some constant @xmath192 , and that each kernel @xmath41 is @xmath193 with @xmath194 for some @xmath195 .if assumption [ assumption1 ] holds with @xmath112 and @xmath6 has a @xmath159-quantile of @xmath177-average type @xmath170 for some @xmath196 $ ] , then by taking @xmath197 , for any @xmath198 and @xmath199 , with confidence at least @xmath200 we have @xmath201 where @xmath202 is a constant independent of @xmath203 and @xmath204 and @xmath205    please note that the exponent @xmath206 given by ( [ quantilerates2 ] ) for the learning rate in ( [ quantilerates ] ) is independent of the quantile level @xmath159 , of the number @xmath148 of additive components in @xmath207 , and of the dimensions @xmath208 and @xmath209 further note that @xmath210 , if @xmath211 , and @xmath212 if @xmath213 . because @xmath214 can be arbitrarily close to @xmath215 , the learning rate , which is independent of the dimension @xmath216 and given by theorem [ quantilethm ] , is close to @xmath217 for large values of @xmath177 and is close to @xmath218 or better , if @xmath211 .      to state our general learning rates, we need an assumption on a _ variance - expectation bound _ which is similar to definition [ noisecond ] in the special case of quantile regression .[ assumption3 ] we assume that there exist an exponent @xmath219 $ ] and a positive constant @xmath220 such that @xmath221    assumption [ assumption3 ] always holds true for @xmath222 . if the triple @xmath223 satisfies some conditions , the exponent @xmath224 can be larger .for example , when @xmath12 is the pinball loss ( [ pinloss ] ) and @xmath6 has a @xmath159-quantile of @xmath177-average type @xmath225 for some @xmath196 $ ] and @xmath226 as defined in @xcite , then @xmath227 .[ mainratesthm ] suppose that @xmath228 is bounded by a constant @xmath229 almost surely . under assumptions [ assumption1 ] to [ assumption3 ] ,if we take @xmath198 and @xmath230 for some @xmath231 , then for any @xmath232 , with confidence at least @xmath200 we have @xmath233 where @xmath234 is given by @xmath235 and @xmath202 is constant independent of @xmath203 or @xmath204 ( to be given explicitly in the proof ) .we now add some theoretical and numerical comparisons on the goodness of our learning rates with those from the literature . as already mentioned in the introduction, some reasons for the popularity of additive models are flexibility , increased interpretability , and ( often ) a reduced proneness of the curse of high dimensions .hence it is important to check , whether the learning rate given in theorem [ mainratesthm ] under the assumption of an additive model favourably compares to ( essentially ) optimal learning rates without this assumption . in other words ,we need to demonstrate that the main goal of this paper is achieved by theorem [ quantilethm ] and theorem [ mainratesthm ] , i.e. that an svm based on an additive kernel can provide a substantially better learning rate in high dimensions than an svm with a general kernel , say a classical gaussian rbf kernel , provided the assumption of an additive model is satisfied .our learning rate in theorem [ quantilethm ] is new and optimal in the literature of svm for quantile regression .most learning rates in the literature of svm for quantile regression are given for projected output functions @xmath236 , while it is well known that projections improve learning rates @xcite . here the projection operator @xmath237 is defined for any measurable function @xmath10 by @xmath238 sometimes this is called clipping .such results are given in @xcite .for example , under the assumptions that @xmath6 has a @xmath159-quantile of @xmath177-average type @xmath170 , the approximation error condition ( [ approxerrorb ] ) is satisfied for some @xmath239 , and that for some constants @xmath240 , the sequence of eigenvalues @xmath241 of the integral operator @xmath117 satisfies @xmath242 for every @xmath243 , it was shown in @xcite that with confidence at least @xmath200 , @xmath244 where @xmath245 here the parameter @xmath246 measures the capacity of the rkhs @xmath247 and it plays a similar role as half of the parameter @xmath147 in assumption 2 . for a @xmath193 kernel and @xmath112, one can choose @xmath246 and @xmath147 to be arbitrarily small and the above power index @xmath248 can be taken as @xmath249 .the learning rate in theorem [ quantilethm ] may be improved by relaxing assumption 1 to a sobolev smoothness condition for @xmath121 and a regularity condition for the marginal distribution @xmath250 .for example , one may use a gaussian kernel @xmath251 depending on the sample size @xmath203 and @xcite achieve the approximation error condition ( [ approxerrorb ] ) for some @xmath252 .this is done for quantile regression in @xcite .since we are mainly interested in additive models , we shall not discuss such an extension .[ gaussmore ] let @xmath48 , @xmath49 $ ] and @xmath50 ^ 2.$ ] let @xmath51 and the additive kernel @xmath72 be given by ( [ gaussaddform ] ) with @xmath253 in example [ gaussadd ] as @xmath52.\\ ] ] if the function @xmath121 is given by ( [ gaussfcn ] ) , @xmath191 almost surely for some constant @xmath192 , and @xmath6 has a @xmath159-quantile of @xmath177-average type @xmath170 for some @xmath196 $ ] , then by taking @xmath197 , for any @xmath145 and @xmath199 , ( [ quantilerates ] ) holds with confidence at least @xmath200 .    it is unknown whether the above learning rate can be derived by existing approaches in the literature ( e.g. @xcite ) even after projection .note that the kernel in the above example is independent of the sample size .it would be interesting to see whether there exists some @xmath99 such that the function @xmath57 defined by ( [ gaussfcn ] ) lies in the range of the operator @xmath254 .the existence of such a positive index would lead to the approximation error condition ( [ approxerrorb ] ) , see @xcite .    let us now add some numerical comparisons on the goodness of our learning rates given by theorem [ mainratesthm ] with those given by @xcite .their corollary 4.12 gives ( essentially ) minmax optimal learning rates for ( clipped ) svms in the context of nonparametric quantile regression using one gaussian rbf kernel on the whole input space under appropriate smoothness assumptions of the target function .let us consider the case that the distribution @xmath6 has a @xmath159-quantile of @xmath177-average type @xmath170 , where @xmath255 , and assume that both corollary 4.12 in @xcite and our theorem [ mainratesthm ] are applicable .i.e. , we assume in particular that @xmath6 is a probability measure on @xmath256 $ ] and that the marginal distribution @xmath257 has a lebesgue density @xmath258 for some @xmath259 . furthermore , suppose that the optimal decision function @xmath260 has ( to make theorem [ mainratesthm ] applicable with @xmath261 $ ] ) the additive structure @xmath207 with each @xmath104 as stated in assumption [ assumption1 ] , where @xmath262 and @xmath263 , with minimal risk @xmath86 and additionally fulfills ( to make corollary 4.12 in @xcite applicable ) @xmath264 where @xmath265 $ ] and @xmath266 denotes a besov space with smoothness parameter @xmath267 .the intuitive meaning of @xmath248 is , that increasing values of @xmath248 correspond to increased smoothness .we refer to ( * ? ? ? * and p. 44 ) for details on besov spaces .it is well - known that the besov space @xmath268 contains the sobolev space @xmath269 for @xmath270 , @xmath271 , and @xmath272 , and that @xmath273 .we mention that if all @xmath41 are suitably chosen wendland kernels , their reproducing kernel hilbert spaces @xmath43 are sobolev spaces , see ( * ? ? ?* thm . 10.35 , p. 160 ) .furthermore , we use the same sequence of regularizing parameters as in ( * ? ? ?4.9 , cor . 4.12 ) , i.e. , @xmath274 where @xmath275 , @xmath276 , @xmath277 $ ] , and @xmath278 is some user - defined positive constant independent of @xmath279 . forreasons of simplicity , let us fix @xmath280 .then ( * ? ? ?4.12 ) gives learning rates for the risk of svms for @xmath159-quantile regression , if a single gaussian rbf - kernel on @xmath281 is used for @xmath159-quantile functions of @xmath177-average type @xmath170 with @xmath255 , which are of order @xmath282 hence the learning rate in theorem [ quantilethm ] is better than the one in ( * ? ? ?4.12 ) in this situation , if @xmath283 provided the assumption of the additive model is valid .table [ table1 ] lists the values of @xmath284 from ( [ explicitratescz2 ] ) for some finite values of the dimension @xmath216 , where @xmath285 .all of these values of @xmath284 are positive with the exceptions if @xmath286 or @xmath287 .this is in contrast to the corresponding exponent in the learning rate by ( * ? ?* cor . 4.12 ) , because @xmath288    table [ table2 ] and figures [ figure1 ] to [ figure2 ] give additional information on the limit @xmath289 .of course , higher values of the exponent indicates faster rates of convergence .it is obvious , that an svm based on an additive kernel has a significantly faster rate of convergence in higher dimensions @xmath216 compared to svm based on a single gaussian rbf kernel defined on the whole input space , of course under the assumption that the additive model is valid .the figures seem to indicate that our learning rate from theorem [ mainratesthm ] is probably not optimal for small dimensions . however , the main focus of the present paper is on high dimensions ..[table1 ] the table lists the limits of the exponents @xmath290 from ( * ? ? ?* cor . 4.12 ) and @xmath291 from theorem [ mainratesthm ] , respectively , if the regularizing parameter @xmath292 is chosen in an optimal manner for the nonparametric setup , i.e. @xmath293 , with @xmath294 for @xmath295 and @xmath296 .recall that @xmath297 $ ] .[ cols= " > , > , > , > " , ]'

    """

    logger.info('define model and model parameters')
    label_smoothing = SMOOTHING
    tokenizer = LongformerTokenizer.from_pretrained(MODEL_PATH,use_fast=True)

    config = LongformerEncoderDecoderConfig.from_pretrained(MODEL_PATH)
    config.attention_dropout = DROPOUT
    # config.gradient_checkpointing = grad_ckpt
    config.attention_mode = ATTENTION_MODE
    config.attention_window = [ATTENTION_WINDOW] * config.encoder_layers

    #longformer-encdec-base-16384/
    model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained(MODEL_PATH, config=config)

    logger.info('tokenize/encode input summary')
    input_ids = tokenizer.encode(input_data, return_tensors="pt").to(device)
    logger.info('get attention mask')
    input_ids, attention_mask = _prepare_input(input_ids)
    logger.info('summarizing article...')
    summary_ids =  model.generate(input_ids=input_ids, attention_mask=attention_mask,
                                                use_cache=True,
                                                num_beams=4,
                                                min_length=200,
                                                max_length=300,
                                 )
    print(tokenizer.batch_decode(summary_ids.tolist(), skip_special_tokens=True))

It ends up generating the following summary which doesn't seem to make sense

[' [ [ ] we have @xmath6, @xmath6, @xmath6, @xmath6, @xmath6, and @xmath6, @xmath6, @xmath6, @xmath6, and @xmath6, @xmath6, and @xmath6, respectively. we have @xmath6, @xmath6, @xmath6, and @xmath6, @xmath6, and @xmath6. we have @xmath6, @xmath6, @xmath6, @xmath6, and @xmath6, @xmath6, @xmath6, and @xmath6, @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath7, and @xmath6, and @xmath6, and @xmath6, and @xmath6, and @xmath7, and @xmath6, and @xmath7, and @xmath6, and @xmath7, and @xmath6, and @xmath6, and @xmath7, and @xmath6, and']

I’m wondering what I might be doing wrong/what I might be missing? I’d greatly appreciate any help/feedback with this

mmoya01 commented 3 years ago

I even used no_repeat_ngram_size=2 but still got an odd output summary. Am I using the token batch_decoder incorrectly?