allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

join symbols in source_text using a space #253

Closed dmh43 closed 1 year ago

dmh43 commented 1 year ago

When source_text is a list of tokens make sure to join them with a space. We weren't doing this which caused the model to see citation text like: Krawczyketal.2013 which is not that kind of data the model was trained on.

I also join target_text with a space, but I havent seen a case where it is a list of length longer than 1.