Parsing non-standard sized documents

NicholasMcElroy commented 2 years ago

Hello,

I've been using VILA for work with scientific publications and it works exceedingly well, and I was wondering if it would be possible to use it for documents that are non-standard sizes (i.e. research posters). Currently, when I attempt to parse a document like that, I get the following error:

Traceback (most recent call last):
  File "/home/nick/.local/lib/python3.9/site-packages/transformers/models/layoutlm/modeling_layoutlm.py", line 105, in forward
    left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
  File "/home/nick/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nick/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/home/nick/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nick/Desktop/tests/ext_tests/test2.py", line 48, in <module>
    predicted_tokens = pdf_predictor.predict(pdf_data)
  File "/home/nick/.local/lib/python3.9/site-packages/vila/predictors.py", line 72, in predict
    model_outputs = self.model(**self.model_input_collator(model_inputs))
  File "/home/nick/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nick/.local/lib/python3.9/site-packages/vila/models/hierarchical_model.py", line 263, in forward
    outputs = self.hierarchical_model(
  File "/home/nick/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nick/.local/lib/python3.9/site-packages/vila/models/hierarchical_model.py", line 223, in forward
    embedded_lines = self.textline_model.embeddings(
  File "/home/nick/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nick/.local/lib/python3.9/site-packages/transformers/models/layoutlm/modeling_layoutlm.py", line 110, in forward
    raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e
IndexError: The :obj:`bbox`coordinate values should be within 0-1000 range.

I'm assuming that it has something to do with the dimensions of the document, but I wasn't completely sure. If there is any input that you can provide on potentially getting this to work I'd greatly appreciate it, thank you!

lolipopshock commented 2 years ago

Thanks! In that case, I suggest you normalize all the token coordinates to 0-1000 manually as we don't do the token position normalization in the code right now.

lolipopshock commented 2 years ago

You might want to check #16

NicholasMcElroy commented 2 years ago

Very cool, I had been normalizing the token coordinates like you had suggested and it's nice that it's a part of the library now. Thank you! Looking forward to seeing the retrained model.

allenai / vila

Parsing non-standard sized documents #14