asu-cactus / netsdb

A system that seamlessly integrates Big Data processing and machine learning model serving in distributed relational database
Apache License 2.0
15 stars 5 forks source link

High performance cross product #65

Closed jiazou-bigdata closed 2 years ago

jiazou-bigdata commented 2 years ago

Added support for the RandomForest model w/ and w/o CrossProduct optimization. The CrossProduct optimization achieved more than 2x speedup.

Note that the following Compiler flags are used and may need to be tuned for different environment:

src/conf/headers/Configuration.h:

#ifndef DEFAULT_MEM_SIZE
#define DEFAULT_MEM_SIZE ((size_t)(62) * (size_t)(1024) * (size_t)(1024))
#endif

#ifndef DEFAULT_NUM_CORES
#define DEFAULT_NUM_CORES 8 
#endif

./src/builtInPDBObjects/headers/TreeResult.h:

#define MAX_BLOCK_SIZE 275000

./src/builtInPDBObjects/headers/Forest.h:

`#define MAX_NUM_TREES 1600

define MAX_NUM_NODES_PER_TREE 512`

w/o CrossProduct optimization:

./scripts/cleanupNode.sh

./scripts/startPseudoCluster.py 8 15000

bin/testDecisionForest Y 2200000 28 275000 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv

bin/testDecisionForest N 2200000 28 275000 F A 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_1600_8_netsdb RandomForest

output count:2200000 positive count:1119844

1st time execution:

UDF Execution Time Duration: 112.956 secs. UDF Load Model Time Duration: 4.20368 secs.

2nd time execution:

UDF Execution Time Duration: 107.671 secs. UDF Load Model Time Duration: 0.804746 secs.

w/ CrossProduct optimization:

./scripts/cleanupNode.sh 

./scripts/startPseudoCluster.py 8 15000

bin/testDecisionForestWithCrossProduct Y 2200000 28 275000 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_1600_8_netsdb RandomForest

//**Note that now we need provide one more parameter at the end: numTrees**

bin/testDecisionForestWithCrossProduct N 2200000 28 275000 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_1600_8_netsdb RandomForest **1600**

total count:2200000 positive count:1119844

1st execution time:

Model Inference Time Duration: 45.7068 secs.

2nd execution time:

Model Inference Time Duration: 44.1866 secs.

jiazou-bigdata commented 2 years ago

@hguan6 Please take a look. Let me know if any problem.

hguan6 commented 2 years ago

I used r4.2xlarge to run the experiments. After I ran bin/testDecisionForest N 2200000 28 275000 F A 32 HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_1600_8_netsdb RandomForest, I got an error:

not builtin and no catalog connection, typeId for pdb::AbstractAggregateCompis 8191
not builtin and no catalog connection, typeId for pdb::Computationis 8191
Not create a set and not load new data to the input set
to search for set in Catalog: decisionForest:labels
Cannot remove set from Catalog, Set decisionForest.labels does not exist 
to broadcast StorageRemoveUserSet
2:getNextObject with msgSize=80
Failed to delete set on 1 nodes. Skipping registering with catalog
to search for set in Catalog: decisionForest:labels
Set decisionForest:labels does not exist
the internalTypeName for pdb::Vector<float> is pdb::Vector<pdb::Nothing> and typeID is 159
Create PDBCatalogSet instance of primary key being: decisionForest:labels
the internalTypeName for pdb::Vector<float> is pdb::Vector<pdb::Nothing> and typeID is 159
Create PDBCatalogSet instance of primary key being: decisionForest:labels
This is not Manager Catalog Node, thus metadata was only registered locally!
Node IP: localhost updated correctly!
******************** desired size = 1000********************
%%%%%%%%%%%%%%%%%DistributedStorageManagerServer: to add private set%%%%%%%%%%%%%%%
Page size is determined to be 67108864
No Computation and Lambda for partitioning
to broadcast StorageAddset
received StorageAddSet
%%%%%%%%Pangea to add a private set%%%%%%%%%%%
creating set in Pangea in distributed environment...with setName=labels
to add set with dbName=decisionForest, typeName=pdb::Vector<float>, setName=labels, setId=1, pageSize=67108864
type not recognized: -1
type doesn't  exist for name=pdb::Vector<float>, and we store it as default type
path to meta file is pdbRoot_localhost_8109/meta/1_decisionForest/0_UnknownUserData/1_labels
file opened:pdbRoot_localhost_8109/data/1_decisionForest/0_UnknownUserData/1_labels
meta file exists
Number of existing pages = 0
to add set with dbName=decisionForest, typeName=UnknownUserData, setName=labels, setId=1, pageSize=67108864
2:getNextObject with msgSize=52
broadcasted StorageAddSet
Created set.
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
Aborted (core dumped)

A similar error occurs when I ran bin/testDecisionForestWithCrossProduct Y 2200000 28 275000 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_1600_8_netsdb RandomForest:

not builtin and no catalog connection, typeId for pdb::AbstractAggregateCompis 8191
not builtin and no catalog connection, typeId for pdb::Computationis 8191
2:getNextObject with msgSize=52
2:getNextObject with msgSize=52
Created database
to search for set in Catalog: decisionForest:inputs
Cannot remove set from Catalog, Set decisionForest.inputs does not exist 
to broadcast StorageRemoveUserSet
To evict all pages in set with dbName=decisionForest and setName=inputs to remove the set
We have 1 partitions
numpages in partition:0 =8
0: PartitionedPageIterator: curTypeId=0,curSetId=0,curPageId=0
going to unpin a clean page...
Storage server: evicting page from cache for dbId:1, typeID:0, setID=0, pageID: 0, tryFlushing=0.
0: PartitionedPageIterator: curTypeId=0,curSetId=0,curPageId=1
going to unpin a clean page...
Storage server: evicting page from cache for dbId:1, typeID:0, setID=0, pageID: 1, tryFlushing=0.
0: PartitionedPageIterator: curTypeId=0,curSetId=0,curPageId=2
going to unpin a clean page...
Storage server: evicting page from cache for dbId:1, typeID:0, setID=0, pageID: 2, tryFlushing=0.
0: PartitionedPageIterator: curTypeId=0,curSetId=0,curPageId=3
going to unpin a clean page...
Storage server: evicting page from cache for dbId:1, typeID:0, setID=0, pageID: 3, tryFlushing=0.
0: PartitionedPageIterator: curTypeId=0,curSetId=0,curPageId=4
going to unpin a clean page...
Storage server: evicting page from cache for dbId:1, typeID:0, setID=0, pageID: 4, tryFlushing=0.
0: PartitionedPageIterator: curTypeId=0,curSetId=0,curPageId=5
going to unpin a clean page...
Storage server: evicting page from cache for dbId:1, typeID:0, setID=0, pageID: 5, tryFlushing=0.
0: PartitionedPageIterator: curTypeId=0,curSetId=0,curPageId=6
going to unpin a clean page...
Storage server: evicting page from cache for dbId:1, typeID:0, setID=0, pageID: 6, tryFlushing=0.
0: PartitionedPageIterator: curTypeId=0,curSetId=0,curPageId=7
going to unpin a clean page...
Storage server: evicting page from cache for dbId:1, typeID:0, setID=0, pageID: 7, tryFlushing=0.
2:getNextObject with msgSize=52
to deregister partition policy for inputs:decisionForest
to search for set in Catalog: decisionForest:inputs
to search for set in Catalog: decisionForest:inputs
Could not delete set, because: Error deleting set: Error failed request to node : localhost:8109. Error is :Set with the identifier decisionForest:inputs does not exist

to search for set in Catalog: decisionForest:inputs
Set decisionForest:inputs does not exist
the internalTypeName for pdb::TensorBlock2D<float> is pdb::TensorBlock2D<pdb::Nothing> and typeID is 145
Create PDBCatalogSet instance of primary key being: decisionForest:inputs
the internalTypeName for pdb::TensorBlock2D<float> is pdb::TensorBlock2D<pdb::Nothing> and typeID is 145
Create PDBCatalogSet instance of primary key being: decisionForest:inputs
This is not Manager Catalog Node, thus metadata was only registered locally!
Node IP: localhost updated correctly!
******************** desired size = 1000********************
%%%%%%%%%%%%%%%%%DistributedStorageManagerServer: to add private set%%%%%%%%%%%%%%%
Page size is determined to be 33554432
No Computation and Lambda for partitioning
to broadcast StorageAddset
received StorageAddSet
%%%%%%%%Pangea to add a private set%%%%%%%%%%%
creating set in Pangea in distributed environment...with setName=inputs
to add set with dbName=decisionForest, typeName=pdb::TensorBlock2D<float>, setName=inputs, setId=0, pageSize=33554432
Set exists with setName=inputs
Set decisionForest:inputs:pdb::TensorBlock2D<float> already exists

2:getNextObject with msgSize=128
broadcasted StorageAddSet
Not able to create set: Could not add set to distributed storage manager:BlockRowIndex:0
BlockColumnIndex:0
TotalRowNums:2200000
TotalColNums:28
BlockRowIndex:1
BlockColumnIndex:0
TotalRowNums:2200000
TotalColNums:28
BlockRowIndex:2
BlockColumnIndex:0
TotalRowNums:2200000
TotalColNums:28
BlockRowIndex:3
BlockColumnIndex:0
TotalRowNums:2200000
TotalColNums:28
Not shallow copy
To send 123201052 data
2:getNextObject with msgSize=123201044
No partition policy was found for set: inputs:decisionForest
Defaulting to random policy
Found new set: inputs:decisionForest
Dispatched 4 blocks.
mappedPartitions size = 1
received StorageAddData
BlockRowIndex:4
BlockColumnIndex:0
TotalRowNums:2200000
TotalColNums:28
received 4 objects to store 123200988 bytes
to store data to decisionForest:inputs
data is buffered, all buffered data size=123200988
we increment numBytes to 123201052
BlockRowIndex:5
BlockColumnIndex:0
TotalRowNums:2200000
TotalColNums:28
BlockRowIndex:6
BlockColumnIndex:0
TotalRowNums:2200000
TotalColNums:28
BlockRowIndex:7
BlockColumnIndex:0
TotalRowNums:2200000
TotalColNums:28
Not shallow copy
To send 123200932 data
2:getNextObject with msgSize=123200924
mappedPartitions size = 1
received StorageAddData
received DistributedStorageCleanup
to wait for all requests get processed
numRequestsInProcessing: 1
received 4 objects to store 123200988 bytes
to store data to decisionForest:inputs
data is buffered, all buffered data size=246401976
we increment numBytes to 246401984
numRequestsInProcessing: 0
All data requests have been served
received StorageCleanup
to clean up for storage...
to write back records for decisionForest:inputs
246401976 bytes to write to a storage page
to allocate a page with size=33554432
PageCache: getNewPage: Page created for typeId=0,setId=0,pageId=8
pageSize = 33554396
to write 4 objects
Writing back a page!!
PageCircularBuffer:got a place.
to allocate a page with size=33554432
PageCircularBuffer: not empty, return the head element
Head pageID=8
Got a page with PageID 8 for partition:0
page dbId=1
page typeId=0
page setId=0
PDBFlushConsumerWork: page freed from cache
PageCircularBuffer: array is empty.
PageCache: getNewPage: Page created for typeId=0,setId=0,pageId=9
to write 4 objects
Writing back a page!!
PageCircularBuffer:got a place.
to allocate a page with size=33554432
PageCircularBuffer: not empty, return the head element
Head pageID=9
Got a page with PageID 9 for partition:0
page dbId=1
page typeId=0
page setId=0
PDBFlushConsumerWork: page freed from cache
PageCircularBuffer: array is empty.
PageCache: getNewPage: Page created for typeId=0,setId=0,pageId=10
to write 4 objects
Writing back a page!!
PageCircularBuffer:got a place.
to allocate a page with size=33554432
PageCircularBuffer: not empty, return the head element
Head pageID=10
Got a page with PageID 10 for partition:0
page dbId=1
page typeId=0
page setId=0
PDBFlushConsumerWork: page freed from cache
PageCircularBuffer: array is empty.
PageCache: getNewPage: Page created for typeId=0,setId=0,pageId=11
to write 4 objects
to write 4 objects
Writing back a page!!
PageCircularBuffer:got a place.
to allocate a page with size=33554432
PageCircularBuffer: not empty, return the head element
Head pageID=11
Got a page with PageID 11 for partition:0
page dbId=1
page typeId=0
page setId=0
PDBFlushConsumerWork: page freed from cache
PageCircularBuffer: array is empty.
PageCache: getNewPage: Page created for typeId=0,setId=0,pageId=12
to write 4 objects
Writing back a page!!
PageCircularBuffer:got a place.
to allocate a page with size=33554432
PageCircularBuffer: not empty, return the head element
Head pageID=12
Got a page with PageID 12 for partition:0
page dbId=1
page typeId=0
page setId=0
PDBFlushConsumerWork: page freed from cache
PageCircularBuffer: array is empty.
PageCache: getNewPage: Page created for typeId=0,setId=0,pageId=13
to write 4 objects
Writing back a page!!
PageCircularBuffer:got a place.
to allocate a page with size=33554432
PageCircularBuffer: not empty, return the head element
Head pageID=13
Got a page with PageID 13 for partition:0
page dbId=1
page typeId=0
page setId=0
PDBFlushConsumerWork: page freed from cache
PageCircularBuffer: array is empty.
PageCache: getNewPage: Page created for typeId=0,setId=0,pageId=14
to write 4 objects
Writing back a page!!
PageCircularBuffer:got a place.
to allocate a page with size=33554432
PageCircularBuffer: not empty, return the head element
Head pageID=14
Got a page with PageID 14 for partition:0
page dbId=1
page typeId=0
page setId=0
PDBFlushConsumerWork: page freed from cache
PageCircularBuffer: array is empty.
PageCache: getNewPage: Page created for typeId=0,setId=0,pageId=15
to write 4 objects
Write all of the bytes in the record.
to flush without eviction
PageCircularBuffer:got a place.
Now all the records are back.
Now there are 16 new objects stored in storage
PageCircularBuffer: not empty, return the head element
Head pageID=15
Got a page with PageID 15 for partition:0
page dbId=1
page typeId=0
page setId=0
PDBFlushConsumerWork: page freed from cache
PageCircularBuffer: array is empty.
2:getNextObject with msgSize=52
to search for set in Catalog: decisionForest:trees
Cannot remove set from Catalog, Set decisionForest.trees does not exist 
to broadcast StorageRemoveUserSet
To evict all pages in set with dbName=decisionForest and setName=trees to remove the set
We have 1 partitions
2:getNextObject with msgSize=52
to deregister partition policy for trees:decisionForest
to search for set in Catalog: decisionForest:trees
to search for set in Catalog: decisionForest:trees
Could not delete set, because: Error deleting set: Error failed request to node : localhost:8109. Error is :Set with the identifier decisionForest:trees does not exist

to search for set in Catalog: decisionForest:trees
Set decisionForest:trees does not exist
the internalTypeName for pdb::Tree is pdb::Tree and typeID is 151
Create PDBCatalogSet instance of primary key being: decisionForest:trees
the internalTypeName for pdb::Tree is pdb::Tree and typeID is 151
Create PDBCatalogSet instance of primary key being: decisionForest:trees
This is not Manager Catalog Node, thus metadata was only registered locally!
Node IP: localhost updated correctly!
******************** desired size = 1000********************
%%%%%%%%%%%%%%%%%DistributedStorageManagerServer: to add private set%%%%%%%%%%%%%%%
Page size is determined to be 8388608
No Computation and Lambda for partitioning
to broadcast StorageAddset
received StorageAddSet
%%%%%%%%Pangea to add a private set%%%%%%%%%%%
creating set in Pangea in distributed environment...with setName=trees
to add set with dbName=decisionForest, typeName=pdb::Tree, setName=trees, setId=1, pageSize=8388608
Set exists with setName=trees
Set decisionForest:trees:pdb::Tree already exists

2:getNextObject with msgSize=112
broadcasted StorageAddSet
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
Aborted (core dumped)

ps. It worked perfectly fine for XGBoost. I also uploaded the converted netsDB model to our s3 bucket.

jiazou-bigdata commented 2 years ago

Hi Hong, I will investigate the error this evening after teaching and meetings. It seems the program meets errors when parsing the model. Have you tried to use the model that I uploaded to S3 for converting to netsDB?

hguan6 commented 2 years ago

Hi Hong, I will investigate the error this evening after teaching and meetings. It seems the program meets errors when parsing the model. Have you tried to use the model that I uploaded to S3 for converting to netsDB?

Yes, I converted it from the .pkl file in S3.

jiazou-bigdata commented 2 years ago

@hguan6 I fixed the problem. Please check again.

Note that for the second command, now we need add a parameter to specify the number of trees at the end.

bin/testDecisionForestWithCrossProduct N 2200000 28 275000 32 model-inference/decisionTree/experiments/HIGGS.csv_test.csv model-inference/decisionTree/experiments/models1/higgs_randomforest_1600_8_netsdb RandomForest 1600

hguan6 commented 2 years ago

@jiazou-bigdata I still got the same error message. Do I need to do something else besides pulling the latest code?

jiazou-bigdata commented 2 years ago

@jiazou-bigdata I still got the same error message. Do I need to do something else besides pulling the latest code?

The model files downloaded from S3 run well from my side. Which branch are you at? Did you recompile the code by running scons libDFTest?

hguan6 commented 2 years ago

The model files downloaded from S3 run well from my side. Which branch are you at? Did you recompile the code by running scons libDFTest?

I didn't recompile the code. Let me do that now.

hguan6 commented 2 years ago

@jiazou-bigdata I recompiled the code. The w Cross product code runs well, and the result I got is consistent with yours: ****Model Inference Time Duration: ****44.7118 secs.

But I got some other errors when I ran the code without Cross product. The error message for bin/testDecisionForest N 2200000 28 275000 F A 32 HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_1600_8_netsdb RandomForest:

not builtin and no catalog connection, typeId for pdb::AbstractAggregateCompis 8191
not builtin and no catalog connection, typeId for pdb::Computationis 8191
Not create a set and not load new data to the input set
to search for set in Catalog: decisionForest:labels
Cannot remove set from Catalog, Set decisionForest.labels does not exist 
to broadcast StorageRemoveUserSet
2:getNextObject with msgSize=80
Failed to delete set on 1 nodes. Skipping registering with catalog
to search for set in Catalog: decisionForest:labels
Set decisionForest:labels does not exist
the internalTypeName for pdb::Vector<float> is pdb::Vector<pdb::Nothing> and typeID is 159
Create PDBCatalogSet instance of primary key being: decisionForest:labels
the internalTypeName for pdb::Vector<float> is pdb::Vector<pdb::Nothing> and typeID is 159
Create PDBCatalogSet instance of primary key being: decisionForest:labels
This is not Manager Catalog Node, thus metadata was only registered locally!
Node IP: localhost updated correctly!
******************** desired size = 1000********************
%%%%%%%%%%%%%%%%%DistributedStorageManagerServer: to add private set%%%%%%%%%%%%%%%
Page size is determined to be 67108864
No Computation and Lambda for partitioning
to broadcast StorageAddset
received StorageAddSet
%%%%%%%%Pangea to add a private set%%%%%%%%%%%
creating set in Pangea in distributed environment...with setName=labels
to add set with dbName=decisionForest, typeName=pdb::Vector<float>, setName=labels, setId=1, pageSize=67108864
type not recognized: -1
type doesn't  exist for name=pdb::Vector<float>, and we store it as default type
path to meta file is pdbRoot_localhost_8109/meta/1_decisionForest/0_UnknownUserData/1_labels
file opened:pdbRoot_localhost_8109/data/1_decisionForest/0_UnknownUserData/1_labels
meta file exists
Number of existing pages = 0
to add set with dbName=decisionForest, typeName=UnknownUserData, setName=labels, setId=1, pageSize=67108864
2:getNextObject with msgSize=52
broadcasted StorageAddSet
Created set.
model-inference/decisionTree/experiments/models/higgs_randomforest_1600_8_netsdb/951.txt
process inner nodes
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
Aborted (core dumped)
jiazou-bigdata commented 2 years ago

@jiazou-bigdata I recompiled the code. The w Cross product code runs well, and the result I got is consistent with yours: ****Model Inference Time Duration: ****44.7118 secs.

But I got some other errors when I ran the code without Cross product. The error message for bin/testDecisionForest N 2200000 28 275000 F A 32 HIGGS.csv_test.csv model-inference/decisionTree/experiments/models/higgs_randomforest_1600_8_netsdb RandomForest:

Thanks Hong for catching this. I forgot to commit this change: https://github.com/asu-cactus/netsdb/pull/65/commits/4596fc77b03b1fd838eabaaa1bd50275a6459344

Now it should work. Please try again!

BTW. I was thinking about better integrating Forest.h and Tree.h so that Forest.h will reuse the constructTreeFromPath function from Tree.h, and that way, we do not need to change this function for different model versions anymore: void constructForestFromPaths(std::vector & treePathIn, ModelType modelType, bool isClassification) {...}. However, I did not have enough time to do so.

@hguan6, can you take a look at this and help merge these two?

https://github.com/asu-cactus/netsdb/blob/4596fc77b03b1fd838eabaaa1bd50275a6459344/src/builtInPDBObjects/headers/Tree.h#L304

https://github.com/asu-cactus/netsdb/blob/master/src/builtInPDBObjects/headers/Forest.h#L254

hguan6 commented 2 years ago

Now it should work. Please try again!

Yes, I will try it now.

hguan6, can you take a look at this and help merge these two?

Yes, once I finish the TFDF experiments, I will take a look.

hguan6 commented 2 years ago

Now it should work. Please try again!

It works now. Here are the results: First run: UDF Execution Time Duration: 99.4244 secs. UDF Load Model Time Duration: 5.59455 secs. Second run: UDF Execution Time Duration: 101.907 secs. UDF Load Model Time Duration: 0.869169 secs.

It is consistent with your results. I think it is ready to be merged.