awslabs / python-deequ

Python API for Deequ
Apache License 2.0
668 stars 131 forks source link

Tutorial dataset amazon-reviews-pds no longer works #150

Open danipilze opened 10 months ago

danipilze commented 10 months ago

Describe the bug Tutorial dataset amazon-reviews-pds it's not longer available, according to this Reddit thread it has been removed https://www.reddit.com/r/dataengineering/comments/15ohj6q/trouble_accessing_the_amazon_reviews_dataset_in/

To Reproduce Steps to reproduce the behavior:

  1. Go to 'tutorials/profiles.ipynb'
  2. Click on 'run'
  3. Scroll down to 'results'
  4. See error
Error while looking for metadata directory in the path: s3a://amazon-reviews-pds/parquet/product_category=Electronics/.
java.nio.file.AccessDeniedException: s3a://amazon-reviews-pds/parquet/product_category=Electronics: getFileStatus on s3a://amazon-reviews-pds/parquet/product_category=Electronics: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YKPCT0JTXNYM9CGF; S3 Extended Request ID: g4wN6cQAjmgk4pHvpVOeEJ1ef21H6PTfJiDnVFNP0agHcoFBxTI11hyOmzHWMzqmz/G0YinuOZ4=; Proxy: null), S3 Extended Request ID: g4wN6cQAjmgk4pHvpVOeEJ1ef21H6PTfJiDnVFNP0agHcoFBxTI11hyOmzHWMzqmz/G0YinuOZ4=:AccessDenied
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:249)
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3348)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:562)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YKPCT0JTXNYM9CGF; S3 Extended Request ID: g4wN6cQAjmgk4pHvpVOeEJ1ef21H6PTfJiDnVFNP0agHcoFBxTI11hyOmzHWMzqmz/G0YinuOZ4=; Proxy: null), S3 Extended Request ID: g4wN6cQAjmgk4pHvpVOeEJ1ef21H6PTfJiDnVFNP0agHcoFBxTI11hyOmzHWMzqmz/G0YinuOZ4=
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1828)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1412)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1374)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5227)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5173)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5167)
    at com.amazonaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:963)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listObjects$7(S3AFileSystem.java:2116)
    at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:412)
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:375)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.listObjects(S3AFileSystem.java:2107)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3322)
    ... 21 more

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[4], line 1
----> 1 df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")
      3 df.printSchema()

Expected behavior Read the dataframe without issues

Environment: Amazon SageMaker Notebook instances Jupyter notebook

komashk commented 2 months ago

Hi @danipilze, the Amazon reviews dataset has indeed been removed and will no longer be available. We have generated a synthetic reviews dataset and are in the process of updating PyDeequ blogs and tutorials with it.

Please use the following link to the S3 s3a://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Electronics/ to access the data and follow the tutorial.

danipilze commented 2 months ago

thanks @komashk that's pretty helpful to give my classes about Profiling with Deequ!