allenai / specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Apache License 2.0
516 stars 56 forks source link

How to change the sequence length? #31

Closed nshinton closed 3 years ago

nshinton commented 3 years ago

Hi, I'd like to change the max sequence length in order to embed larger documents.

Is there an extra argument I can give to embed.py to do this?

I notice that embed_papers_hf.py has a max_length parameter, but to use that script I need some way to specify that I don't have a GPU.

Would appreciate any help with either of these scripts. :)

armancohan commented 3 years ago

SPECTER's underlying pre-trained Transformer model is SciBERT, which has the 512 token sequence length limit of BERT. At this time we only have pre-trained SPECTER starting from ScBERT. If you'd like to process longer inputs, you need to switch SciBERT to something like Longformer and retrain SPECTER again.