JC-Shi / Learned-Index-Benefits

13 stars 5 forks source link

The issue about the data and feature generation process #9

Open sylph520 opened 2 weeks ago

sylph520 commented 2 weeks ago

Since there is no corresponding data for SQL files and codes for generating features of training and test data, I found it difficult to interpret the csv files you provided. e.g., in my interpretation, there are a list of data in the csv file, each item with a list of column encodings for operators and the reduction ratio label in a query. However, when i was checking the first item in data/TPC_DS_10_by_query.csv:. All columns seems be in multi-column indexes, the number of columns with ordering 1 is 2 while the number of columns with ordering 1 is 7. Should not these two number be equal?
For better reproducibility, I'm wondering would you share the essential codes for generate features and training data, or at least a better illustration about some query examples, how it corresponds to your provided data. Otherwise, it would be difficult for us to test your methods on other workload data. Thanks.

JC-Shi commented 1 week ago

Hello, thanks for the feedback. As the features abstraction script closely relies on the query plan representations and the naming methods used in each database system, it is challenging to design an one-for-all features extraction script for all forms of query plans and different database systems. Regarding the index information encoding, we first utilise one-hot encoding to represent whether an index is multi-column index or single-column index. Then we will include the order of the column in the index as the third features. As an index can impact on different operations with different columns, it may be encoded multiple times during feature extractions. Furthermore, there may be two indexes (both are multi-column index) contain the same column. In this case, we encode the index information for the index optimisable operations using the multi-column index that has the least order. The provided csv files are generated based on the methods discussed in the original paper, if you have any further questions, please feel free to discuss with us. Once again, thank you for the feedback.