Closed juliodelgadoaws closed 5 years ago
:exclamation: No coverage uploaded for pull request base (
master@d5986da
). Click here to learn what that means. The diff coverage is100%
.
@@ Coverage Diff @@
## master #68 +/- ##
=========================================
Coverage ? 92.29%
=========================================
Files ? 52
Lines ? 3140
Branches ? 86
=========================================
Hits ? 2898
Misses ? 242
Partials ? 0
Impacted Files | Coverage Δ | |
---|---|---|
...agemaker/sparksdk/protobuf/ProtobufConverter.scala | 100% <100%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update d5986da...9fe28da. Read the comment docs.
UPDATE: The previous dense example was also incorrect. Please refer to the examples in the later posts.
Awesome! The example in https://github.com/aws/sagemaker-spark/issues/63 worked out. However, there seems to be an inconsistency between sparse and dense matrices?
Continuing on the original example:
# this is correct
sm = df.select(F.udf(lambda *args: Matrices.sparse(9, 2, [0,2,4], np.ravel(args), np.ravel(args)).toDense(),
MatrixUDT())
('c1', 'c2')
.alias('features')
)
sm.show()
sm.write.format('sagemaker').save('spark.io', mode='overwrite')
with open(glob.glob('spark.io/part-*')[0], 'rb') as f:
print(read_records(f))
# this is incorrect?
sm = df.select(F.udf(lambda *args: Matrices.sparse(9, 2, [0,2,4], np.ravel(args), np.ravel(args)),
MatrixUDT())
('c1', 'c2')
.alias('features')
)
sm.show()
sm.write.format('sagemaker').save('spark.io', mode='overwrite')
with open(glob.glob('spark.io/part-*')[0], 'rb') as f:
print(read_records(f))
I've checked out the results and they match the encoding of Sparse matrices:
See https://github.com/aws/sagemaker-spark/pull/68/files#diff-4a25a60c7160028e6f8a4c76223cd342R226
The Sparse matrix has to be “encoded” into the Record. The schema is:
https://github.com/aws/sagemaker-spark/pull/68/files#diff-4a25a60c7160028e6f8a4c76223cd342R220
@yifeim provided the following example. I'll change the encoding format:
Let’s take an example. Say the matrix is [0,3,4,0,0,0,0] [0,0,0,0,0,7,8]
It has shape (2,7). The sparse vector representation is:
[features {
key: "values"
value {
float32_tensor {
values: 3.0
values: 4.0
keys: 1
keys: 2
shape: 7
}
}
}
, features {
key: "values"
value {
float32_tensor {
values: 7.0
values: 8.0
keys: 5
keys: 6
shape: 7
}
}
}
]
If we make it a sparse matrix, the representation should be:
[features {
key: "values"
value {
float32_tensor {
values: 3.0
values: 4.0
values: 7.0
values: 8.0
keys: 1
keys: 2
keys: 12 # floor(12/7) gives the row 1, and 12%7 gives the column 5
keys: 13 # floor(13/7) gives the row 1, and 12%7 gives the column 6
shape: 2 # number of rows
shape: 7 # number of columns
}
}
}
The format also extends to higher-dimensional objects (if supported by spark).
I'll be sending a new revision today in which SparseMatrix is encoded in a different way. This is the way originally intended in the Record-protobuf format.
Thanks! I verify that I can reproduce the example using the following codes:
df = spark.createDataFrame([[[
[0,3,4,0,0,0,0],
[0,0,0,0,0,7,8],
]]], ['X'])
df.show()
sm = df.select(F.udf(lambda X: Matrices.dense(2,7,np.ravel(np.transpose(X))).toSparse(),
MatrixUDT())('X')
.alias('features'))
sm.show()
sm.repartition(1).write.format('sagemaker').save('spark.io', mode='overwrite')
with open(glob.glob('spark.io/part-*')[0], 'rb') as f:
print(read_records(f))
[features {
key: "values"
value {
float32_tensor {
values: 3.0
values: 4.0
values: 7.0
values: 8.0
keys: 1
keys: 2
keys: 12
keys: 13
shape: 2
shape: 7
}
}
}
]
Errors were do to coverage being out of date. That should be resolved. Resolving conflicts and running again.
Encode DenseMatrix and SparseMatrix in probobuf Record format. Transposed matrices are not supported.
*Issue #, if available: 63
*Description of changes: Encode DenseMatrix and SparseMatrix in probobuf Record format. Transposed matrices are not supported.
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.