aws / sagemaker-spark

A Spark library for Amazon SageMaker.
https://aws.github.io/sagemaker-spark/
Apache License 2.0
299 stars 127 forks source link

Add suport for Matrix in RecordIO-Protobuf (#63) #68

Closed juliodelgadoaws closed 5 years ago

juliodelgadoaws commented 6 years ago

Encode DenseMatrix and SparseMatrix in probobuf Record format. Transposed matrices are not supported.

*Issue #, if available: 63

*Description of changes: Encode DenseMatrix and SparseMatrix in probobuf Record format. Transposed matrices are not supported.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov-io commented 6 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@d5986da). Click here to learn what that means. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #68   +/-   ##
=========================================
  Coverage          ?   92.29%           
=========================================
  Files             ?       52           
  Lines             ?     3140           
  Branches          ?       86           
=========================================
  Hits              ?     2898           
  Misses            ?      242           
  Partials          ?        0
Impacted Files Coverage Δ
...agemaker/sparksdk/protobuf/ProtobufConverter.scala 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d5986da...9fe28da. Read the comment docs.

yifeim commented 6 years ago

UPDATE: The previous dense example was also incorrect. Please refer to the examples in the later posts.

Awesome! The example in https://github.com/aws/sagemaker-spark/issues/63 worked out. However, there seems to be an inconsistency between sparse and dense matrices?

Continuing on the original example:

# this is correct
sm = df.select(F.udf(lambda *args: Matrices.sparse(9, 2, [0,2,4], np.ravel(args), np.ravel(args)).toDense(),
                     MatrixUDT())
                    ('c1', 'c2')
              .alias('features')
             )
sm.show()

sm.write.format('sagemaker').save('spark.io', mode='overwrite')

with open(glob.glob('spark.io/part-*')[0], 'rb') as f:
    print(read_records(f))

# this is incorrect?
sm = df.select(F.udf(lambda *args: Matrices.sparse(9, 2, [0,2,4], np.ravel(args), np.ravel(args)),
                     MatrixUDT())
                    ('c1', 'c2')
              .alias('features')
             )
sm.show()

sm.write.format('sagemaker').save('spark.io', mode='overwrite')

with open(glob.glob('spark.io/part-*')[0], 'rb') as f:
    print(read_records(f))
juliodelgadoaws commented 6 years ago

I've checked out the results and they match the encoding of Sparse matrices:

See https://github.com/aws/sagemaker-spark/pull/68/files#diff-4a25a60c7160028e6f8a4c76223cd342R226

The Sparse matrix has to be “encoded” into the Record. The schema is:

https://github.com/aws/sagemaker-spark/pull/68/files#diff-4a25a60c7160028e6f8a4c76223cd342R220

juliodelgadoaws commented 6 years ago

@yifeim provided the following example. I'll change the encoding format:

Let’s take an example. Say the matrix is [0,3,4,0,0,0,0] [0,0,0,0,0,7,8]

It has shape (2,7). The sparse vector representation is:

[features {
  key: "values"
  value {
    float32_tensor {
      values: 3.0
      values: 4.0
      keys: 1
      keys: 2
      shape: 7
    }
  }
}
, features {
  key: "values"
  value {
    float32_tensor {
      values: 7.0
      values: 8.0
      keys: 5
      keys: 6
      shape: 7
    }
  }
}
]

If we make it a sparse matrix, the representation should be:

[features {
  key: "values"
  value {
    float32_tensor {
      values: 3.0
      values: 4.0
      values: 7.0
      values: 8.0
      keys: 1
      keys: 2
      keys: 12 # floor(12/7) gives the row 1, and 12%7 gives the column 5
      keys: 13 # floor(13/7) gives the row 1, and 12%7 gives the column 6
      shape: 2 # number of rows
      shape: 7 # number of columns
    }
  }
}

The format also extends to higher-dimensional objects (if supported by spark).

juliodelgadoaws commented 6 years ago

I'll be sending a new revision today in which SparseMatrix is encoded in a different way. This is the way originally intended in the Record-protobuf format.

yifeim commented 6 years ago

Thanks! I verify that I can reproduce the example using the following codes:

df = spark.createDataFrame([[[
    [0,3,4,0,0,0,0],
    [0,0,0,0,0,7,8],
]]], ['X'])
df.show()

sm = df.select(F.udf(lambda X: Matrices.dense(2,7,np.ravel(np.transpose(X))).toSparse(),
                MatrixUDT())('X')
          .alias('features'))
sm.show()

sm.repartition(1).write.format('sagemaker').save('spark.io', mode='overwrite')
with open(glob.glob('spark.io/part-*')[0], 'rb') as f:
    print(read_records(f))
[features {
  key: "values"
  value {
    float32_tensor {
      values: 3.0
      values: 4.0
      values: 7.0
      values: 8.0
      keys: 1
      keys: 2
      keys: 12
      keys: 13
      shape: 2
      shape: 7
    }
  }
}
]
ChoiByungWook commented 6 years ago

Errors were do to coverage being out of date. That should be resolved. Resolving conflicts and running again.