apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.45k stars 3.7k forks source link

Usability problems #13784

Open uuf6429 opened 1 year ago

uuf6429 commented 1 year ago

I'm completely new to Druid and I'm evaluating it as a DW+Analytics solution (using the docker-compose sample from this repo).

Unfortunately, I've had some problems working with it. I think some of them might be intentional, but maybe not so clear to a first time user? Here goes:

  1. If data is not loaded successfully after ingestion (e.g. records do not match the preconfigured task regex), the ingestion itself is marked as successful, but nothing is imported. To me this is a problem for these reasons: a. this effectively means data might be / is being lost b. this is happening without warning c. couldn't find any logging related to this (the logs panel was not useful tracking this problem) d. I'd expect to see a warning somewhere when tasks do not import anything (and maybe a setting to turn it off whenever it makes sense)
  2. When I had two ingestion tasks pointing to the same datasource, the last one ended up overwriting data from the previous one. I still can't figure out the exact reason, but: a. as a data platform, data should be sacred 😄 - it should triple check with the user if data is overwritten or dropped automatically b. while investigating, I came up to the "segment granularity" setting, which I assume relates to my problem - again, if it is potentially removing data, it should be visually prominent c. in my opinion, the UX should be geared to a more cautious approach - ie, it's easier to delete data than getting it back, so maybe it's better to have defaults that ensure data is retained
  3. Not sure if this is a bug. I had some segments with a "0" in the "Num rows" column, but the "Records" tab in the segments modal actually showed a record.
  4. I don't know what I did but at one point, while ingestion still worked, datasources were all marked as "unavailable" (yellow circle) and I had to restart all the containers to get it unstuck.

In the end, my current objective is to periodically take some statistics from a bunch of urls (each url returns a flat json object). The Druid tasks for this were being triggered by NiFi (via Druid POST API). Speaking in RDBMS terms, I'd like a row for each url in the same stats table (datasource in Druid terms, right?).

github-actions[bot] commented 8 months ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.