The HeteroDataBuilder currently does the following:
loads each table (in full) using pd.read_sql
computes the edge_index for each relation using Pandas on top of the loaded tables
This is really fast. Since Pandas also supports pd.read_sql_query for any SQL query built using SQLAlchemy, I propose to rewrite BFSStrategy using Pandas as well. I expect that the benefits may be speed (hopefully), cleaner code, and results that will be more consistent with HeteroDataBuilder (as the new type converters use Pandas anyway as well - also for speed reasons).
I think the new BFSStrategy could work as follows:
load the target table (or a batch from the target table) using a single call to pd.read_sql
then load the joins like that as well within the BFS
then the edge_index computation can be done at the end similarly as I do it (hopefully)
Then at the end we should probably merge HeteroDataBuilder with Dataset and somehow find a nice way to have it as two different strategies for the dataset ("full strategy" vs "bfs strategy").
The HeteroDataBuilder currently does the following:
pd.read_sql
edge_index
for each relation using Pandas on top of the loaded tablesThis is really fast. Since Pandas also supports
pd.read_sql_query
for any SQL query built using SQLAlchemy, I propose to rewrite BFSStrategy using Pandas as well. I expect that the benefits may be speed (hopefully), cleaner code, and results that will be more consistent with HeteroDataBuilder (as the new type converters use Pandas anyway as well - also for speed reasons).I think the new BFSStrategy could work as follows:
pd.read_sql
edge_index
computation can be done at the end similarly as I do it (hopefully)Then at the end we should probably merge HeteroDataBuilder with Dataset and somehow find a nice way to have it as two different strategies for the dataset ("full strategy" vs "bfs strategy").