Open nelsonic opened 5 years ago
Have been looking into CIDs and IPFS. See here for all thoughts captured.
CIDs are made up of codecs and multihashes.
Multihashes themselves are self describing hashes (e.g. they contain information about the hashing algorithm that was used to hash the data). See this comment for an example of a multihash in elixir.
In order to create our own CIDs it appears we needs to be able to create a multihash (something that we have been able to do with ex_multihash and a codec.
Looking into codecs now to get a better understanding of what exactly they are. They appear to have a similar role in a CID as the hash_type in a multihash.
I have been able to recreate the steps listed in this article. This shows that I can at least create the same hashes at CIDv0
myself using the command line.
I do not fully understand what additional data that IPFS is adding to the data that you store on it...
$ echo "Hello World" | ipfs add -n
$ added QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u
$ ipfs block get QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u | sed -n l
$ $
$ \022\b\002\022\fHello World$
$ \030\f$
As you can see above, I have just added Hello World
text to a file in IPFS but when I log the file from IPFS it shows more than just Hello World
now.
However, I do not think that we need to fully understand exactly what extra data IPFS is adding right now. We should (hopefully) be able to use a library that will handle this step for us (otherwise we are just reimplementing the IPFS logic ourselves which doesn't seem logical)
The code above only relates to CIDv0.
CIDv1
<mb><version><mc><mh>
CIDv0
<mh>
mb = multibase prefix
version = CID version
mc = multicodec-packed-content-type
mh = multihash-content-address
CIDv0
is only the last part of CIDv1
.
Try to recreate the steps to create a CIDv0
hash taken in the command line, but in elixir.
At the moment I want to create the same hash as the one from the command line.
e.g. the result of...
:crypto.hash(:sha256, "Hello World")
should be the same as
echo "Hello World" | shasum -a 256
d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26
This is not the same hash that IPFS creates but if we can match a more simple sha256 hash that would be a good first step.
file = File.read!("hello.txt")
:crypto.hash(:sha256, file)
|> Base.encode16(case: :lower)
|> IO.inspect()
"d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26"
This is the same and the sha256
function from the command line now. Now I know that I can reliably get the same hash string using elixir as I can in the terminal
@RobStallion CID v0 is irrelevant to us at this stage. We will not use it.
The only reason it still exists is for backward compatibility reasons for existing CIDs.
Since we are only creating new
CIDs in our apps and not decoding any CIDs on that have been put on IPFS we do not need v0 compatibility for the foreseeable future.
I thought I would focus on CIDv0
as it is contained in CIDv1
...
CIDv1
<mb><version><mc><mh>
CIDv0
<mh>
Also, I have installed IPFS locally and the hash string that I am getting back at the moment is still CIDv0
I do want to get v1
working as it is the 'future', but I felt that work on v0
would be needed no matter what.
Have I misunderstood this @nelsonic?
@RobStallion provided the <mh>
(multihash) that is used is the one we need then, yes. π₯
We only need sha2-256
(which I have added above) ... thanks for reminding me. π
iex(1)> codec = "dag-pb" # the codec needed to hash CIDv0
"dag-pb"
iex(2)> file = File.read!("hello.txt") # reading the file with the text of Hello World (same file that was uploaded to IPFS)
"Hello World\n"
iex(3)> digest = :crypto.hash(:sha256, file) # hash the file string with sha256
<<210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205, 139, 226, 199, 74, 221,
90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41, 218, 128, 74, 38>>
iex(4)> {:ok, multihash} = Multihash.encode(:sha2_256, digest) # create multihash from hash (see #8 for more info on this step)
{:ok,
<<18, 32, 210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205, 139, 226, 199,
74, 221, 90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41, 218, 128, 74,
38>>}
iex(5)> cid = CID.cid!(multihash, codec, 0) # creates a CID struct (just simple struct creation. Something everyone has done 1000 times)
%CID{
codec: "dag-pb",
multihash: <<18, 32, 210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205,
139, 226, 199, 74, 221, 90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41,
218, 128, 74, 38>>,
version: 0
}
iex(6)> CID.encode!(cid) # turns the CID struct into a base58 string (this is where the magic is happening)
"QmcWyBPyedDzHFytTX6CAjjpvqQAyhzURziwiBKDKgqx6R"
Using this online tool I have converted the base58
string to base16
.
1220d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26
As you can see, this matches my digest from IPFS. This means that the above functions are working as expected.
Most of the above is pretty straightforward to understand. Need to look into the CID.encode function to get a better understanding of what is happening here and how it works.
#progress
π (keep up the good work!)If you are able to push some of the code on your branch it would be amaze! πΆ Thanks! β¨
1220d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26
As you can see, this matches my digest from IPFS. This means that the above functions are working as expected.
This line was a mistake. It is not the same as the one from IPFS. It is the same as the hash of the file that was created in the terminal here and in iex here.
What this means (to me at least) is that all the CID.encode
function does (for CIDv0) is take the multihash and turn it into a base58
string.
That is literally it!!!
This can be done with the following lines of code...
defmodule CidTester do
def read_file(str), do: File.read!(str)
def hash(file), do: :crypto.hash(:sha256, file)
def multihash(digest), do: Multihash.encode(:sha2_256, digest)
def encode({:ok, multihash}), do: Base.encode16(multihash, case: :lower)
def run(filename) do
filename
|> read_file()
|> hash()
|> multihash()
|> encode()
end
end
Then run iex -S mix
and call the CidTester.run/1
function with the filename...
iex(1)> Cid.run("hello.txt")
"1220d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26"
as you can see this is the same as calling the CID.encode!/1
function with a CID
struct...
(following block is a snippet from here)
iex(6)> CID.encode!(cid)
"QmcWyBPyedDzHFytTX6CAjjpvqQAyhzURziwiBKDKgqx6R"
Same as "1220d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26" when converted to base16.
The ex_cid
module is not returning the same CID values as IPFS. It is only returning the multihash as a base58 string (for version 0 CIDs).
This does not mean that the ex_cid
module is not working however. I have spoken to @SimonLab and he has shown that js-cid produces the same cid string.
This really confused me as both modules are producing the same string (which is just a base58 string of a multihash) and that string is not the same as the one from IPFS. This seems to be because these modules are not adding the data that IPFS adds data when it is added to IPFS. For example...
this is the hello text file on my local machine...
$ cid sed -n l hello.txt
Hello World$
Next, I'll add this file to IPFS...
$ ipfs add -n hello.txt
$ added QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u hello.txt
Now if we run the same sed
command on the IPFS file we see that there is more info than the one on my machine...
$ ipfs block get QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u | sed -n l
$ $
$ \022\b\002\022\fHello World$
$ \030\f$
I think that the difference in the CIDs is coming from this extra data that IPFS is adding to the file.
This means that as it currently stands, the CIDs that these modules are creating can not be used to get data from IPFS as they will not be the correct CID for the data that is on IPFS.
to put it simply "QmcWyBPyedDzHFytTX6CAjjpvqQAyhzURziwiBKDKgqx6R" from the CID module is not the same as "QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u" from IPFS despite the same file being passed in to both.
The CIDs that these modules make can be used in our projects and will always produce the same CID for the same data that is passed in. We just cannot integrate them into IPFS right now as they will not be able to that same data that is on IPFS (if my understanding is correct).
After speaking with @SimonLab about this problem, he came across this, https://github.com/ipfs/ipfs#protocol-implementations.
This seems to be the missing step. I haven't had much of a chance to look into this as of now but on my brief look it says to raise and issue if you want to implement this in a specific language. I looked at the issues and the only issue I saw with a mention of elixir is issue83. This issue has a link to the following repo, https://github.com/tensor-programming/Elixir-Ipfs-Api.
I will begin looking into this 'missing step' in more detail.
@nelsonic @SimonLab do either of you have any thoughts on this (sorry for the SUPER long comment. Hopefully it makes sense)
@nelsonic I believe that in order for us to be able to complete this issue (Implement an IPFS compatible CID function in Elixir) we will need to include this step
@RobStallion this comment makes sense. π (thanks for adding this detail) Please formulate this question on StackOverflow so that (a) we confirm our own understanding and (b) we can seek help from the IPFS/JS community. Thanks. β¨
https://github.com/ipfs/go-cid/issues/77. Someone has had this issue in go
.
I have confirmed that I can get a matching CID using ex_cid
when the cid is v1 and the codec is "raw".
$ ipfs add --cid-version=1 hello.txt
added zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9 hello.txt
Now in iex
iex(1)> file = File.read!("hello.txt")
"Hello World\n"
iex(2)> digest = :crypto.hash(:sha256, file)
<<210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205, 139, 226, 199, 74, 221,
90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41, 218, 128, 74, 38>>
iex(3)> {:ok, multihash} = Multihash.encode(:sha2_256, digest)
{:ok,
<<18, 32, 210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205, 139, 226, 199,
74, 221, 90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41, 218, 128, 74,
38>>}
iex(4)> cid = CID.cid!(multihash, "raw", 1)
%CID{
codec: "raw",
multihash: <<18, 32, 210, 168, 79, 75, 139, 101, 9, 55, 236, 143, 115, 205,
139, 226, 199, 74, 221, 90, 145, 27, 166, 77, 242, 116, 88, 237, 130, 41,
218, 128, 74, 38>>,
version: 1
}
iex(5)> CID.encode cid
{:ok, "zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9"}
As you can see, the two CIDs created match (for sure this time ππ€¦ββοΈ)
zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9
zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9
This IS a step in the right direction but is not a solution. This will not work for all files. It will only work for files that are smaller than a certain size (256kb).
Let's repeat the steps above with a larger file...
$ ipfs add --cid-version=1 elm-slides.pdf
added zdj7We6WnfhRq5zmJZDeMKdKmS2z8fEPrUSneapijtnQYzYpm elm-slides.pdf
1.11 MiB / 1.11 MiB [===========================================================] 100.00%
As you can see this file is 1.11MiB. When we repeat the steps with ex_cid
with this file...
iex(1)> file = File.read!("elm-slides.pdf")
<<37, 80, 68, 70, 45, 49, 46, 55, 13, 10, 37, 161, 179, 197, 215, 13, 10, 49,
32, 48, 32, 111, 98, 106, 13, 10, 60, 60, 47, 80, 97, 103, 101, 115, 32, 50,
32, 48, 32, 82, 32, 47, 84, 121, 112, 101, 47, 67, 97, 116, ...>>
iex(2)> digest = :crypto.hash(:sha256, file)
<<80, 53, 122, 165, 21, 149, 132, 189, 86, 141, 57, 245, 185, 240, 119, 254,
217, 210, 49, 37, 225, 87, 43, 153, 79, 135, 166, 115, 82, 144, 54, 51>>
iex(3)> {:ok, multihash} = Multihash.encode(:sha2_256, digest)
{:ok,
<<18, 32, 80, 53, 122, 165, 21, 149, 132, 189, 86, 141, 57, 245, 185, 240, 119,
254, 217, 210, 49, 37, 225, 87, 43, 153, 79, 135, 166, 115, 82, 144, 54,
51>>}
iex(4)> cid = CID.cid!(multihash, "raw", 1)
%CID{
codec: "raw",
multihash: <<18, 32, 80, 53, 122, 165, 21, 149, 132, 189, 86, 141, 57, 245,
185, 240, 119, 254, 217, 210, 49, 37, 225, 87, 43, 153, 79, 135, 166, 115,
82, 144, 54, 51>>,
version: 1
}
iex(5)> CID.encode cid
{:ok, "zb2rhc3P77eryPttouAgYrzwuByVmkDSrLRt1UciwUmWmUzCS"}
You can see that the 2 CIDs do not match...
zdj7We6WnfhRq5zmJZDeMKdKmS2z8fEPrUSneapijtnQYzYpm
zb2rhc3P77eryPttouAgYrzwuByVmkDSrLRt1UciwUmWmUzCS
@RobStallion it's good that you are being thorough with your investigation, but please note that we will not be hashing files (yet) only hashing Ecto Changesets i.e. Elixir Maps in order to generate the CID for a record before inserting it into the database.
We can return
to the "large file" quest later or even write a Node.js/Go microservice on AWS lambda to do our file uploads e.g: uploading images. For now we litterally only need the most basic CID such that a map of %User{ name: "Rob", username: "robdabank"}
will create a valid CID so we can insert the data.
It seems that when we upload a small file to IPFS in version1 with the "raw" codec it doesn't manipulate the data. This can be seen with the following...
$ ipfs add --cid-version=1 hello.txt # add hello.txt to ipfs
added zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9 hello.txt
sed -n l hello.txt # print contents of hello.txt
Hello World$
$ ipfs block get zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9 | sed -n l # print contents of file from ipfs
Hello World$
As you can see, when we retrieve the file from IPFS and log the data is hasn't added anything new to it like it did when we did this with v0
(see this comment for example)
@nelsonic based on this comment maybe we could make a start with...
%CID{
codec: "raw",
multihash: << the hash of a user struct for example (see below)* >>,
version: 1
}
And see if we can get the same CID given the multihash of a struct.
*Will need to look into how to hash a struct. I believe that there is an erlang function for turning maps into strings but not sure about this at the moment
If we can convert structs into strings reliably (for example order of the keys in will not effect the generated string) then (hopefully) we should be able to create the same CID with IPFS and ex_cid
@RobStallion agreed. please focus on that. π We might need to convert the Struct to JSON and then hash the stringified JSON in order to make it JS-compatible...? π€
Could we use the term_to_binary
and binary_to_term
functions from erlang to marshal/unmarshal any elixir data?
const CID = require('../src')
const multihashing = require('multihashing-async')
const buffer = Buffer.from('Hello World\n')
multihashing(buffer, 'sha2-256', (err, mh) => {
const cid = new CID(1, 'raw', mh)
console.log(cid.toBaseEncodedString())
})
zb2rhkpbfTBtUV1ESqSScrUre8Hh77fhCKDLmX21rCo5xp8J9
Confirms that the js-cid
package will return the same CID given the same value.
Give both packages a JSON object and confirm that they create the same CID.
JS implementation
const aObj = { a: "a" }
const json_a = JSON.stringify(aObj)
console.log(json_a);
const buffer2 = Buffer.from(json_a)
multihashing(buffer2, 'sha2-256', (err, mh) => {
const cid = new CID(1, 'raw', mh)
console.log(cid.toBaseEncodedString())
})
{"a":"a"}
zb2rhdeaHh2UHghBcwxeFP1GRUYETDH96DkV6oppiz5Gk1xGN
elixir implementation
iex(1)> map = %{a: "a"}
%{a: "a"}
iex(2)> json = Jason.encode!(map)
"{\"a\":\"a\"}"
iex(3)> digest = :crypto.hash(:sha256, json)
iex(4)> {:ok, multihash} = Multihash.encode(:sha2_256, digest)
iex(5)> cid = CID.cid!(multihash, "raw", 1)
iex(6)> CID.encode(cid)
{:ok, "zb2rhdeaHh2UHghBcwxeFP1GRUYETDH96DkV6oppiz5Gk1xGN"}
Both CIDs appear to be the same π π
Going to try and recreate this with IPFS now
IPFS implementation
$ echo "{\"a\":\"a\"}" | ipfs add --cid-version 1
added zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ
This CID is different. Not sure why at the moment. Looking to this in more detail
$ echo "{\"a\":\"a\"}"
{"a":"a"}
Seems to return a JSON looking object.
Getting the file from IPFS also looks like it returns that same object...
$ ipfs block get zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ | sed -n l
{"a":"a"}$
In the earlier examples with the string of "Hello World"
, we had to add a new line to the end of the string in order to get the same CID. Will try this with the JSON string
new elixir implementation...
Created a file called json.txt which only contains
{"a":"a"}
iex(14)> file = File.read!("json.txt")
"{\"a\":\"a\"}\n"
iex(15)> digest2 = :crypto.hash(:sha256, file)
<<72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167, 139, 145, 12, 84, 241,
135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208, 82, 81, 176, 200>>
iex(16)> {:ok, multihash2} = Multihash.encode(:sha2_256, digest2)
{:ok,
<<18, 32, 72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167, 139, 145, 12,
84, 241, 135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208, 82, 81, 176,
200>>}
iex(17)> cid2 = CID.cid!(multihash2, "raw", 1)
%CID{
codec: "raw",
multihash: <<18, 32, 72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167,
139, 145, 12, 84, 241, 135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208,
82, 81, 176, 200>>,
version: 1
}
iex(18)> CID.encode(cid2)
{:ok, "zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ"}
This is now the same as IPFS
zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ
{:ok, "zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ"}
The only difference from the first elixir implementation is that json
variable did not have a new line, "\n"
, on the end.
first elixir attempt
iex(2)> json = Jason.encode!(map)
"{\"a\":\"a\"}"
second attempt
iex(14)> file = File.read!("json.txt")
"{\"a\":\"a\"}\n"
We should easily be able to fix this by just appending a new line to the end of a JSON object in elixir.
elixir implementation adding new line to end of json...
iex(1)> map = %{a: "a"}
%{a: "a"}
iex(2)> json = Jason.encode!(map)
"{\"a\":\"a\"}"
iex(3)> json = json <> "\n"
"{\"a\":\"a\"}\n"
iex(4)> digest = :crypto.hash(:sha256, json)
<<72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167, 139, 145, 12, 84, 241,
135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208, 82, 81, 176, 200>>
iex(5)> {:ok, multihash} = Multihash.encode(:sha2_256, digest)
{:ok,
<<18, 32, 72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167, 139, 145, 12,
84, 241, 135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208, 82, 81, 176,
200>>}
iex(6)> cid = CID.cid!(multihash, "raw", 1)
%CID{
codec: "raw",
multihash: <<18, 32, 72, 240, 49, 34, 62, 109, 19, 11, 162, 226, 162, 167,
139, 145, 12, 84, 241, 135, 103, 97, 197, 136, 212, 17, 101, 6, 242, 208,
82, 81, 176, 200>>,
version: 1
}
iex(7)> CID.encode(cid)
{:ok, "zb2rhbYzyUJP6euwn89vAstfgG2Au9BSwkFGUJkbujWztZWjZ"}
Same as IPFS again.
I would say that this is working reliably now.
Will test with https://github.com/dwyl/cid/pull/19 when @SimonLab is ready π
@nelsonic
Do you think that the following points have been covered...
Write comprehensive doctests that demonstrate that the code works as expected.
Create beginner-friendly examples. (we can split this out into separate repos later!)
If so can you check them off in the acceptance criteria please?
Going to work on the following points from the acceptance criteria...
estimate t25m.
@RobStallion doctests are good. β beginner-friendly example: https://github.com/dwyl/phoenix-ecto-append-only-log-example/issues/22 please proceed. Thanks!
This issue/epic is dedicated exclusively to How i.e. implementation
Todo
[ ] Read the JavaScript Implementation of CID: https://github.com/multiformats/js-cid to understand how it works. If you have questions, please ask them in: https://github.com/dwyl/learn-ipfs/issues
localhost
and try to see if re-ordering elements in an Object or nested Array produces a different CID.[x] Read the "not working" Elixir version: https://github.com/nocursor/ex-cid
[ ] Implement an "offline" version of CID in Elixir that produces the exact same CID as the JS version.
cid
of aString
should always be the same for a given string.cid
of aMap
should work regardless of the order of content.MVP CID v1
for our MVP we only need a
sha2-256
hash inBase58BTC
which is URL-safe. For this we can use the code from https://github.com/multiformats/ex_multihash (which is maintained) and https://github.com/nocursor/b58 (which is unresponsive) respectively.doctests
that demonstrate that the code works as expected.Relevant Reading
Help Wanted
We really need help on getting this package built, documented and shipped so we can move forward with our "stack" https://github.com/dwyl/technology-stack/issues/67 and "roadmap" https://github.com/dwyl/product-roadmap If you have the curiosity, energy and time to help, please comment below! (Thanks!)