elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.09k stars 119 forks source link

** (RuntimeError) from_parquet failed: {:polars, "External format error: File out of specification: A parquet file must containt a header and footer with at least 12 bytes"} #516

Closed davidtew closed 1 year ago

davidtew commented 1 year ago

Running in LiveBook configured to a local phoenix app; Firefox, macos 12.6

I am simply reporting the errors below:

[error] GenServer #PID<0.28348.1> terminating ** (RuntimeError) from_parquet failed: {:polars, "External format error: File out of specification: A parquet file must containt a header and footer with at least 12 bytes"} (explorer 0.5.0) lib/explorer/data_frame.ex:556: Explorer.DataFrame.from_parquet!/2

cell:rp2643xe3yq6kslovkswg24ufkh2twz5:306: G.Batch.QphoneNETA.logger/0

#cell:sesm2ltwzm2p6v6w4rukgu5mc63mbzkc:15: G.Batch.QphoneNETA.Periodically.handle_info/2
(stdlib 4.1.1) gen_server.erl:1123: :gen_server.try_dispatch/4
(stdlib 4.1.1) gen_server.erl:1200: :gen_server.handle_msg/6
(stdlib 4.1.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

Last message: :work State: %{}

Could it have been caused by simultaneous reads or read/write clashes? I was running a process periodically via a genserver - with a 1 minut period. The process calls an external API and processes, for a few seconds. (also, I don't know if I unintentionally made more than one GenServer run the code at the same time - is this even possible ... I am new to elixir).

I now have changed the period to 3 mins (a few hours ago) and it hasn't happened since.

Apart from this, Explorer has been reading files (60000 x 20 and 160000 x 14) and writing to them hundreds of times a day with no problem. I am using Explorer as a 'database' at the moment, reading/duplicating many google sheets.

BTW, at some point, same evening, I also got a Bus Error 10 ... I cannot say if the two error messages occurred at the same time

josevalim commented 1 year ago

Could it have been caused by simultaneous reads or read/write clashes?

Perhaps! Are you removing or writing to this file concurrently? If you are doing periodic reads, I assume it is because there is something periodically changing it too?

davidtew commented 1 year ago

[Could it have been caused by simultaneous reads or read/write clashes?] Perhaps! Are you removing or writing to this file concurrently?

TL:DR It seems rare but possible. Two (or more) sets of code, addressing the same parquet files, are uncontrolled i.e. each code periodically does work without reference to other code; the periods might clash so that the same file is addressed at the same time.

Detail:
I have 3 parquet files, each of which is topped up with data from APIs polled every few minutes - I'll call this 'polling code' (two from google sheets, one from a form submission SaaS).

The data collected in the parquet files is then to be processed e.g. matching incoming information to customer data - this one I'll call 'processing code'.

Within the polling code, one parquet file is read (to get the previous last date) which is used in the call to the API. Any return information is written back to this parquet file. Conflict could also happen here if the Genserver is polling twice simultaneously (IDK).

Within the processing code, the parquet file from the polling code is read, then new data is processed. Processed data is written back to other files (erlang term storage for now).

Occasionally the processing code and the polling code would inadvertently and coincidentally run at the same time, I suppose. Within this clash there is less likelihood that the same parquet file will be addressed simultaneously (since it is a millisecond operation) ... but it could happen once in a while. I don't know of any locking mechanism in Elixir to prevent this, so haven't deployed anything of this nature.

josevalim commented 1 year ago

One way to lock is to make sure that the same process does both actions, even if it uses distinct intervals for each. Given it is always the same process, they cannot overlap. Another option is to have a server responsible for rotating files and you ask it what is the most recent file to process, this way you always keep the last 2 or 3 around.

I will go ahead and close this, let us know if it persists even if there are not chances of concurrent writes/removals. Thanks for the updates!