dmis-lab / bioasq-biobert

Pre-trained Language Model for Biomedical Question Answering
https://arxiv.org/abs/1909.08229
Other
122 stars 21 forks source link

Covid-19 papers #8

Closed tonyreina closed 4 months ago

tonyreina commented 4 years ago

I was thinking of using BioBERT-BioASQ as a webservice for people to scan Covid-19 articles ("context") and ask questions about them. One thing I wasn't sure of was the sequence length. I think these have to be 384 tokens or less. If I fine-tune the model can I expand the sequence length to be something more like 2048 tokens? Would that affect the accuracy? Or are there better ways to handle full length articles as the context? Thanks. -Tony

jhyuklee commented 4 years ago

Hi @tonyreina, actually we are preparing a webservice for COVID-19 papers and it will be available soon. To answer your question, the sequence length longer than 384 can be sliced with a 384-token window, which is how BERT processes long sequences. It definitely affects the accuracy (mostly leads to lower acc) and you would need to properly normalize the tokens across the multiple sequences. Clark's paper on this matter might help. Thanks.

tonyreina commented 4 years ago

Very cool. I work at Intel and we're interested in helping out wherever we can. Is there anything we could do to help? I'm wondering if you need compute resources or programming help in deploying. Please let me know.

jhyuklee commented 4 years ago

Thank you, Tony. As soon as we are ready for the deployment, we will ask for the help. I'll let you know whenever we are ready. Thanks.