apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.92k stars 985 forks source link

Can Apache Drill query a list of files with updated data? #2910

Closed kevinlo closed 2 months ago

kevinlo commented 2 months ago

I am new to Apache Drill. I post the question in stackflow and don't get an answer. So, I try it here to see if someone can answer the question. I am sorry if that is not the right place.

I have a large (more than 8.5GB) CSV file that is updated on the first day of each month. But from the 2nd to the last day of each month, it can have new updated data in the JSON format. These JSON format data will be merged to the CSV and become the new CSV on the first day of next month.

I convert the CSV to panquet and do the query in Apache Drill, it works fine. But how can I query the big file with the updated file?

e.g. In the Apr 1st CSV file, it has

ID Name Value LastUpdatedTime 100 John 98 2024-01-05 In the Apr 15 JSON file, it has

ID Name Value LastUpdatedTime 100 John 100 2024-04-15 When it query all these files for ID = 100, it should give Value=100 as it has newer LastUpdatedTime.

I find this post saying people use Drill on data that is no longer changing.

Is that true?

I have considered using the CREATE TABLE using the CSV and then update the TABLE with the JSON data, but I don't see the SQL Reference having the ALTER TABLE command, is it possible to do it?

cgivre commented 2 months ago

@kevinlo Welcome to Drill. To answer your question, yes Drill can query multiple files at once.
Let's say you have a directory of JSON files called my_files. You could write a query like the examples below which would query the JSON files in that directory.

For instance:


-- Query all JSON files in a path
SELECT *
FROM dfs.`/path/to/my/files/*.json`

-- Query all files in a given folder
SELECT * 
FROM dfs.`my_files/`

-- Query files with a given file name
SELECT * 
FROM dfs.`my_files/data**.csv`

Drill supports globs in file paths so you can use your imagination.

With respect to UPDATING data, more recent versions of Drill support INSERT queries but only to external systems such as RDBMS, Splunk etc.

I hope this helps.

kevinlo commented 2 months ago

@cgivre Thanks for your reply. Could you please point me to the documentation about the INSERT queries you mentioned? I check the Supported SQL Commands and can't find it.

cgivre commented 2 months ago

@kevinlo It looks like we need to update the docs. Here are some links which may help:

https://github.com/apache/drill/pull/2646

cgivre commented 2 months ago

I updated the list of commands, however, we need to add some additional documentation.