csm / datahike-s3

Datahike on S3
3 stars 0 forks source link

Round-tripping of data seems to fail #1

Open csm opened 4 years ago

csm commented 4 years ago

Something, somewhere seems to be turning a hitchhiker-tree node into a PersistentArrayMap, causing this after transacting a bunch of data:

#error{:cause "No implementation of method: :resolve of protocol: #'hitchhiker.tree.core/IResolve found for class: clojure.lang.PersistentArrayMap",
       :via [{:type java.lang.IllegalArgumentException,
              :message "No implementation of method: :resolve of protocol: #'hitchhiker.tree.core/IResolve found for class: clojure.lang.PersistentArrayMap",
              :at [clojure.core$_cache_protocol_fn invokeStatic "core_deftype.clj" 583]}],
       :trace [[clojure.core$_cache_protocol_fn invokeStatic "core_deftype.clj" 583]
               [clojure.core$_cache_protocol_fn invoke "core_deftype.clj" 575]
               [hitchhiker.tree.core$eval17439$fn__17462$G__17430__17467 invoke "core.cljc" 118]
               [hitchhiker.tree.messaging$enqueue invokeStatic "messaging.cljc" 89]
               [hitchhiker.tree.messaging$enqueue invoke "messaging.cljc" 77]
               [hitchhiker.tree.messaging$enqueue invokeStatic "messaging.cljc" 126]
               [hitchhiker.tree.messaging$enqueue invoke "messaging.cljc" 77]
               [hitchhiker.tree.messaging$enqueue invokeStatic "messaging.cljc" 81]
               [hitchhiker.tree.messaging$enqueue invoke "messaging.cljc" 77]
               [hitchhiker.tree.messaging$insert invokeStatic "messaging.cljc" 225]
               [hitchhiker.tree.messaging$insert invoke "messaging.cljc" 223]
               [datahike.index.hitchhiker_tree$_insert invokeStatic "hitchhiker_tree.cljc" 113]
               [datahike.index.hitchhiker_tree$_insert invoke "hitchhiker_tree.cljc" 112]
               [datahike.index$eval18956$fn__18965 invoke "index.cljc" 51]
               [datahike.index$eval18815$fn__18905$G__18798__18914 invoke "index.cljc" 8]
               [datahike.db$with_datom$fn__19748 invoke "db.cljc" 1093]
               [clojure.lang.AFn applyToHelper "AFn.java" 154]
               [clojure.lang.AFn applyTo "AFn.java" 144]
               [clojure.core$apply invokeStatic "core.clj" 667]
               [clojure.core$update_in$up__6853 invoke "core.clj" 6185]
               [clojure.core$update_in invokeStatic "core.clj" 6186]
               [clojure.core$update_in doInvoke "core.clj" 6172]
               [clojure.lang.RestFn invoke "RestFn.java" 445]
               [datahike.db$with_datom invokeStatic "db.cljc" 1093]
               [datahike.db$with_datom invoke "db.cljc" 1071]
               [clojure.lang.AFn applyToHelper "AFn.java" 156]
               [clojure.lang.AFn applyTo "AFn.java" 144]
               [clojure.core$apply invokeStatic "core.clj" 667]
               [clojure.core$update_in$up__6853 invoke "core.clj" 6185]
               [clojure.core$update_in invokeStatic "core.clj" 6186]
               [clojure.core$update_in doInvoke "core.clj" 6172]
               [clojure.lang.RestFn invoke "RestFn.java" 467]
               [datahike.db$transact_report invokeStatic "db.cljc" 1116]
               [datahike.db$transact_report invoke "db.cljc" 1114]
               [datahike.db$transact_add invokeStatic "db.cljc" 1214]
               [datahike.db$transact_add invoke "db.cljc" 1198]
               [datahike.db$transact_tx_data invokeStatic "db.cljc" 1434]
               [datahike.db$transact_tx_data invoke "db.cljc" 1280]
               [datahike.core$with invokeStatic "core.cljc" 231]
               [datahike.core$with invoke "core.cljc" 224]
               [datahike.core$_transact_BANG_$fn__22650 invoke "core.cljc" 438]
               [clojure.lang.Atom swap "Atom.java" 37]
               [clojure.core$swap_BANG_ invokeStatic "core.clj" 2352]
               [clojure.core$swap_BANG_ invoke "core.clj" 2345]
               [datahike.core$_transact_BANG_ invokeStatic "core.cljc" 437]
               [datahike.core$_transact_BANG_ invoke "core.cljc" 434]
               [datahike.core$transact_BANG_ invokeStatic "core.cljc" 529]
               [datahike.core$transact_BANG_ invoke "core.cljc" 444]
               [datahike.core$transact invokeStatic "core.cljc" 636]
               [datahike.core$transact invoke "core.cljc" 629]
               [datahike.core$transact invokeStatic "core.cljc" 633]
               [datahike.core$transact invoke "core.cljc" 629]
               [datahike.connector$eval37403$fn__37405$fn__37407 invoke "connector.cljc" 31]
               [clojure.core$binding_conveyor_fn$fn__5754 invoke "core.clj" 2030]
               [clojure.lang.AFn call "AFn.java" 18]
               [java.util.concurrent.FutureTask run "FutureTask.java" 266]
               [java.util.concurrent.ThreadPoolExecutor runWorker "ThreadPoolExecutor.java" 1149]
               [java.util.concurrent.ThreadPoolExecutor$Worker run "ThreadPoolExecutor.java" 624]
               [java.lang.Thread run "Thread.java" 748]]}

I think the fressian serializer is set up correctly, so it's likely the logic in -update-in.

whilo commented 4 years ago

Very nice work! I have just tried to run the konserve tests, but I seem to need an S3 account (?).

Have you made any progress on this issue? We would really like to support S3 with datahike.

csm commented 4 years ago

For testing the best option is probably s4, which I wrote for this exact use case (for extra fun s4 also uses konserve as its own backend, so you could possibly build a tower of systems each going through s3-via-konserve 😏 )

I was also recently experimenting with writing hitchhiker trees directly to S3 along with doing my own tx-log in dynamodb.

Also I split out the konserve part into it's own repo.

whilo commented 4 years ago

Yes, I have just seen this. Pretty cool. If konserve-ddb-s3 works then we would have effectively a general purpose hitchhiker-tree with S3 support, but I assume you have realized this. The roots of the tree are written in datahike explicitly to the same backend, but you could totally use dynamo DB for this. I still have to comprehend all the S3 bits your are juggeling in the konserve backend.

Btw. do you still have a handle on the clojurians slack?

whilo commented 4 years ago

s4 is super cool! :heart_eyes: Can you reproduce this issue with s4? That would maybe indicate it is a konserve issue independent of s3 internals.

whilo commented 4 years ago

Your konserve tests look like they have covered nesting and unnesting with update-in and get-in. It would be interesting to see how the Persistent map looks like that is mistakingly popping up. I think most likely is serializer issue. Btw. :+1: for the lz4 compression, I think that would be cool to have in konserve in general.

csm commented 4 years ago

It's possible I fixed this issue in the various iterations of the code; unfortunately I don't remember if I still saw this issue in the last tests I ran.

So it's possible this issue just fell away while I was iterating on the implementation.

whilo commented 4 years ago

So this issue is not popping up for you when you run your tests? If I understand correctly to test it I have to set the proper credentials, e.g. through environment variables: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html ?

csm commented 4 years ago

Look at this example script to start DynamoDB and S4 locally, storing data to the filesystem: https://github.com/csm/datahike-s3/blob/master/examples/mbrainz.clj. You don't need AWS credentials to use the local dynamodb/s3 servers.

That AWS docs link should work for specifying real AWS credentials for most cases; this uses the cognitect AWS client. I know that both the default credentials profile file and instance profile credentials work.

whilo commented 4 years ago

Ok, thanks for providing background. I seem to get an error on using reflection to mess with Unsafe and a region error, but I guess they are Java version related (probably I need an older JDK or something, I will check that tomorrow). https://pastebin.com/vVC25pmE

csm commented 4 years ago

I started doing some further work in these repos:

The idea is that as ops are added to the index node, they are put in DynamoDB, avoiding churn of objects written to S3.