In a lot of cases when converting really large datasets I have observed that the resulting files in parquet may be corrupted; resulting in errors when we need to read them. I suggest to integrate into each command for conversion an integrity check in the file that we produce. Example error:
Traceback (most recent call last):
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/bin/quantmsio_cli", line 33, in <module>
sys.exit(load_entry_point('quantmsio==0.0.3', 'console_scripts', 'quantmsio_cli')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/quantmsio-0.0.3-py3.11.egg/quantms_io/quantmsio_cli.py", line 70, in quantms_io_main
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/quantmsio-0.0.3-py3.11.egg/quantms_io/commands/get_unanimous_command.py", line 30, in get_unanimous_for_parquet
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/quantmsio-0.0.3-py3.11.egg/quantms_io/core/tools.py", line 232, in map_protein_for_parquet
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/quantmsio-0.0.3-py3.11.egg/quantms_io/core/tools.py", line 278, in change_and_save_parquet
File "/hps/software/users/juan/pride/anaconda3/envs/quantmsio/lib/python3.11/site-packages/quantmsio-0.0.3-py3.11.egg/quantms_io/core/tools.py", line 311, in read_large_parquet
File "pyarrow/_parquet.pyx", line 1323, in iter_batches
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Tried reading 1025877 bytes starting at position 31565171 from file but only got 0
In a lot of cases when converting really large datasets I have observed that the resulting files in parquet may be corrupted; resulting in errors when we need to read them. I suggest to integrate into each command for conversion an integrity check in the file that we produce. Example error: