Issue while loading openaqa_5.6M.json

sonalkum commented 9 months ago

Hello,

Thank you so much for sharing the code. Great work on the repo!!

I am trying to run the code for LTU openaqa, I've completed the first 3 stage of training, but I am stuck on the 4th stage. So, I was wondering if you faced any issue similar to the following:

  pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2234708122

If yes, how did you resolve it? I think this is an issue with pyarrow which is used in HuggingFace's datasets.load_dataset() function.

Thanks in advance.

YuanGongND commented 9 months ago

hi there,

Thanks for the question.

No, we didn't have this error. FYI, our machine has 512G CPU RAM (not VRAM). This might be an RAM issue.

A simple Google find this: https://github.com/huggingface/datasets/issues/4782

-Yuan

YuanGongND commented 9 months ago

Consider trimming the data a little bit. I actually believe cutting the stage 4 data into half with a random sampling and double the training epoch could train a model with reasonable performance.

sonalkum commented 9 months ago

Thank you for your quick response. Even in our setting we have 512G of CPU RAM, so I was curious if you had faced any similar issue.

YuanGongND commented 9 months ago

This is bit weird, that reminds me that in LTU-AS, we have 10.6M training data, but still do not have any issue.

Have you changed any code?

YuanGongND / ltu

Issue while loading openaqa_5.6M.json #21