ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.43k stars 189 forks source link

Validate filesystem URLs at planning time #251

Open mwylde opened 10 months ago

mwylde commented 10 months ago

For filesystem sources created via SQL, we do not validate them as part of the SQL planning process. This causes panics at runtime when the source is instantiated on the worker:

2023-08-15T03:28:44.132894Z ERROR arroyo_server_common: panicked at 'called `Result::unwrap()` 
on an `Err` value: RelativeUrlWithoutBase', /opt/arroyo/src/arroyo-worker/src/connectors/filesystem/mod.rs:119:65 
panic.file="/opt/arroyo/src/arroyo-worker/src/connectors/filesystem/mod.rs" panic.line=119 panic.column=65
hilmialf commented 8 months ago

Hi @mwylde I also would like to give this one a shot. Could you guide me the details how to start? Perhaps could you tell me where the SQL planning is done? IIUC, the planning is delegated to datafusion?

rohitrastogi commented 2 months ago

@mwylde I'm not able to reproduce this specific panic (RelativeUrlWithoutBase) exactly, but I did notice a few related issues when trying to reproduce it using ghcr.io/arroyosystems/arroyo-single:0.10-dev: 1) Pipelines/previews succeed even if the path for filesystem source created via SQL does not exist. I'd expect there to be some sort of failure if the path does not exist. 2) Path "file:///" for filesystem source created with SQL panics during query execution with ERROR arroyo_server_common: panicked at crates/arroyo-connectors/src/filesystem/source.rs:69:17: could not get next path: Generic LocalFileSystem error: Unable to walk dir: File system loop found: /sys/class/vtconsole/vtcon0/subsystem points to an ancestor /sys/class/vtconsole panic.file="crates/arroyo-connectors/src/filesystem/source.rs" panic.line=69 panic.column=17

  1. An S3 path without valid S3 creds for the filesystem source created with SQL panics during query execution with: panicked at crates/arroyo-connectors/src/filesystem/source.rs:69:17: could not get next path: Generic s3 error: Couldn't find AWS credentials in environment, credentials file, or IAM role. panic.file="crates/arroyo-connectors/src/filesystem/source.rs" panic.line=69 panic.colum
  2. Creating filesystem sources in the UI always succeeds, even if the inputted path is malformed. See: https://github.com/ArroyoSystems/arroyo/blob/5fc6fe06cbbdc866f232bd813eb5e8aff16bcb3a/crates/arroyo-connectors/src/filesystem/mod.rs#L70-L77
  3. Kafka sources created via SQL panic if the topic does not exist. panicked at crates/arroyo-worker/src/lib.rs:622:14: calledResult::unwrap()on anErrvalue: SendError { .. } panic.file="crates/arroyo-worker/src/lib.rs" panic.line=622 panic.column=14

What do you think about running the same connection test() logic that is run when creating connectors in the UI when planning sources during the scheduling phase? If each connector properly implements the test() logic, it should solve all of the problems above.