cognitect-labs / aws-api

AWS, data driven
Apache License 2.0
727 stars 100 forks source link

XML parse issue with s3 :SelectObjectContent #132

Open jimmyhmiller opened 4 years ago

jimmyhmiller commented 4 years ago

Dependencies

{:deps {com.cognitect.aws/api {:mvn/version "0.8.445"}
        com.cognitect.aws/endpoints {:mvn/version "1.1.11.732"}
        com.cognitect.aws/s3 {:mvn/version "784.2.593.0"}}}

Description with failing test case

When trying to run s3 SelectObjectContent the response body is not just xml and so parsing the body fails. you can see this by running:


(def bucket "my-bucket")

(aws/invoke s3 {:op :PutObject
                :request {:Bucket bucket
                          :Key "test.csv"
                          :Body "col1,col2\n1,2\n,3,4"}})

(aws/invoke s3
            {:op :SelectObjectContent
             :request
             {:Bucket bucket
              :Key "test.csv"
              :Expression "select * from S3Object s"
              :ExpressionType "SQL"
              :InputSerialization {:CSV {:RecordDelimiter "\n"}}
              :OutputSerialization {:JSON {:RecordDelimiter "\n"}}}})

As was discussed in slack, this error occurs because aws returns a custom format and can actually return different formats based on the options you based. Ideally I'd love to be able to call this and get a stream that just contains the records I am looking for. But things like progress could make this potentially harder to deal with.

Stack traces


{:cognitect.anomalies/category :cognitect.anomalies/fault,
 :cognitect.aws.client/throwable #error {
 :cause "ParseError at [row,col]:[1,1]\nMessage: Content is not allowed in prolog."
 :via
 [{:type javax.xml.stream.XMLStreamException
   :message "ParseError at [row,col]:[1,1]\nMessage: Content is not allowed in prolog."
   :at [com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl next "XMLStreamReaderImpl.java" 652]}]
 :trace
 [[com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl next "XMLStreamReaderImpl.java" 652]
  [clojure.data.xml.jvm.parse$pull_seq$fn__29940 invoke "parse.clj" 78]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.RT seq "RT.java" 535]
  [clojure.core$seq__5402 invokeStatic "core.clj" 137]
  [clojure.core$seq__5402 invoke "core.clj" 137]
  [clojure.data.xml.tree$seq_tree$fn__29773 invoke "tree.clj" 39]
  [clojure.lang.LazySeq sval "LazySeq.java" 42]
  [clojure.lang.LazySeq seq "LazySeq.java" 51]
  [clojure.lang.LazySeq first "LazySeq.java" 73]
  [clojure.lang.RT first "RT.java" 692]
  [clojure.core$first__5384 invokeStatic "core.clj" 55]
  [clojure.core$ffirst__5394 invokeStatic "core.clj" 103]
  [clojure.core$ffirst__5394 invoke "core.clj" 103]
  [clojure.data.xml.tree$event_tree invokeStatic "tree.clj" 70]
  [clojure.data.xml.tree$event_tree invoke "tree.clj" 66]
  [clojure.data.xml$parse invokeStatic "xml.clj" 109]
  [clojure.data.xml$parse doInvoke "xml.clj" 84]
  [clojure.lang.RestFn invoke "RestFn.java" 486]
  [cognitect.aws.util$xml_read invokeStatic "util.clj" 148]
  [cognitect.aws.util$xml_read invoke "util.clj" 145]
  [cognitect.aws.shape$xml_parse invokeStatic "shape.clj" 217]
  [cognitect.aws.shape$xml_parse invoke "shape.clj" 214]
  [cognitect.aws.protocols.rest$parse_body invokeStatic "rest.clj" 243]
  [cognitect.aws.protocols.rest$parse_body invoke "rest.clj" 235]
  [cognitect.aws.protocols.rest$parse_http_response invokeStatic "rest.clj" 260]
  [cognitect.aws.protocols.rest$parse_http_response invoke "rest.clj" 249]
  [cognitect.aws.protocols.rest_xml$eval32111$fn__32112 invoke "rest_xml.clj" 23]
  [clojure.lang.MultiFn invoke "MultiFn.java" 239]
  [cognitect.aws.client$handle_http_response invokeStatic "client.clj" 49]
  [cognitect.aws.client$handle_http_response invoke "client.clj" 44]
  [cognitect.aws.client$send_request$fn__31513$state_machine__20194__auto____31540$fn__31543 invoke "client.clj" 112]
  [cognitect.aws.client$send_request$fn__31513$state_machine__20194__auto____31540 invoke "client.clj" 108]
  [clojure.core.async.impl.ioc_macros$run_state_machine invokeStatic "ioc_macros.clj" 973]
  [clojure.core.async.impl.ioc_macros$run_state_machine invoke "ioc_macros.clj" 972]
  [clojure.core.async.impl.ioc_macros$run_state_machine_wrapped invokeStatic "ioc_macros.clj" 977]
  [clojure.core.async.impl.ioc_macros$run_state_machine_wrapped invoke "ioc_macros.clj" 975]
  [clojure.core.async.impl.ioc_macros$take_BANG_$fn__20212 invoke "ioc_macros.clj" 986]
  [clojure.core.async.impl.channels.ManyToManyChannel$fn__15077$fn__15078 invoke "channels.clj" 95]
  [clojure.lang.AFn run "AFn.java" 22]
  [java.util.concurrent.ThreadPoolExecutor runWorker "ThreadPoolExecutor.java" 1128]
  [java.util.concurrent.ThreadPoolExecutor$Worker run "ThreadPoolExecutor.java" 628]
  [clojure.core.async.impl.concurrent$counted_thread_factory$reify__14946$fn__14947 invoke "concurrent.clj" 29]
  [clojure.lang.AFn run "AFn.java" 22]
  [java.lang.Thread run "Thread.java" 830]]}}
ghadishayban commented 4 years ago

Docs for this snowflake API: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTSelectObjectAppendix.html

jamesdavidson commented 5 months ago

Yep, it's a weird one. I ended up just using the Java SDK (v1.12.132) via interop like so:

(let
  [client (-> (com.amazonaws.services.s3.AmazonS3ClientBuilder/standard)
              (.withCredentials (new com.amazonaws.auth.profile.ProfileCredentialsProvider "my-profile"))
              .build)
  bucket-name "bucket2"
  object-key "inventory/bucket1/DailyInventory/data/e56b826c-f557-445a-8389-645dcf95d2d2.csv.gz"
  query "SELECT s._1, s._2 FROM S3Object s limit 25"
  output-file-path "output.csv"
  input-serialization (-> (new com.amazonaws.services.s3.model.InputSerialization)
                          (.withCsv (new com.amazonaws.services.s3.model.CSVInput))
                          (.withCompressionType
                            (com.amazonaws.services.s3.model.CompressionType/GZIP)))
  output-serialization (-> (new com.amazonaws.services.s3.model.OutputSerialization)
                           (.withCsv (new com.amazonaws.services.s3.model.CSVOutput)))
  req (-> (new com.amazonaws.services.s3.model.SelectObjectContentRequest)
          (.withBucketName bucket-name)
          (.withKey object-key)
          (.withExpression query)
          (.withExpressionType com.amazonaws.services.s3.model.ExpressionType/SQL)
          (.withInputSerialization input-serialization)
          (.withOutputSerialization output-serialization))
  res (.selectObjectContent client req)]
  (with-open
   [out (clojure.java.io/output-stream output-file-path)]
    (-> res .getPayload .getRecordsInputStream (clojure.java.io/copy out))))

Full class names for easy copy and pasting.