dice-group / dice-embeddings

Hardware-agnostic Framework for Large-scale Knowledge Graph Embeddings
MIT License
38 stars 12 forks source link

pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file. #246

Open sshivam95 opened 1 week ago

sshivam95 commented 1 week ago

When training an embedding model on a KG, I am getting the following error stack:

Reading with pandas.read_csv with sep ** s+ ** ...
Traceback (most recent call last):
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/bin/dicee", line 33, in <module>
    sys.exit(load_entry_point('dicee', 'console_scripts', 'dicee')())
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/scripts/run.py", line 137, in main
    Execute(get_default_arguments()).start()
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/executer.py", line 218, in start
    self.load_indexed_data() if self.is_continual_training else self.read_preprocess_index_serialize_data()
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/executer.py", line 88, in read_preprocess_index_serialize_data
    self.knowledge_graph = self.read_or_load_kg()
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/executer.py", line 53, in read_or_load_kg
    kg = KG(dataset_dir=self.args.dataset_dir,
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/knowledge_graph.py", line 74, in __init__
    ReadFromDisk(kg=self).start()
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/read_preprocess_save_load_kg/read_from_disk.py", line 28, in start
    self.kg.raw_train_set = read_from_disk(self.kg.path_single_kg,
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/read_preprocess_save_load_kg/util.py", line 125, in read_from_disk
    return read_with_pandas(data_path, read_only_few, sample_triples_ratio)
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/read_preprocess_save_load_kg/util.py", line 31, in timeit_wrapper
    result = func(*args, **kwargs)
  File "/scratch/hpc-prf-dsg/WHALE-output/dice-embeddings/dicee/read_preprocess_save_load_kg/util.py", line 83, in read_with_pandas
    df = pd.read_csv(data_path,
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/scratch/hpc-prf-dsg/sshivam/.conda/envs/dice/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Initially, I thought it was an issue with the input file, however, after adding engine='python' in pandas.read_csv method in dicee/read_preprocess_save_load_kg/util.py, the error no longer persists.