Closed RobStallion closed 5 years ago
Hi @RobStallion, thanks for posting these good questions as an issue. 🎉
in future consider opening separate issues for each question you have for clarity 💭 and making the question the title of the issue for SEO benefits 🔍 and making the life of the "next dev" easier when they have similar questions ... 😉
cid
will be universally unique.
That needs to be clear to everyone from the first line of the readme. (if it's not, then it's "my bad" and I need to "fix" it ...)
We will be using the SHA256 hash which to date has not incurred a single hash collision. 256 bits of data is used by most crypto currencies. Enough "smart people" have done the homework on this for us not to worry about it.
The "math" is covered in: https://crypto.stackexchange.com/questions/39641/what-are-the-odds-of-collisions-for-a-hash-function-with-256-bit-output
This is the best video on 256 bit hash collision probability: https://youtu.be/S9JGmA5_unY
This video does not cover the "Birthday Paradox" see: https://github.com/nelsonic/nelsonic.github.io/issues/576 But again, for the purposes of this answer and indeed any project we are likely to work on in our lifetime, when dealing with 256 bit hashes, the chance of a "birthday attack" creating a collision is "ignorable".
cid
means we have the <option>
to store content on IPFSWe need to make it clear that using a cid
as the unique identifier for a record
means we can optionally store content on IPFS for redundancy/decentralisation,
but for the purposes of building our Apps in 2019, we are NOT going to even try to build "D-Apps" because unfortunately there is no way of maintaining privacy for private/personal content on IPFS without pre-storage encryption which then automatically implies storing encryption/decryption keys somewhere centrally.
i.e. something will need to be stored centrally, so we might as well store the data centrally
to reduce query time and request latency in any App(s) we build.
Note: the reason I haven't previously "proclaimed" in https://github.com/dwyl/technology-stack/issues/67 that "all our apps will be distributed by
default
from now on", is because the Application building "story" is incomplete on IPFS/IPLD.
There is no way of deleting old data that people no longer want to exist: https://github.com/ipfs/faq/issues/9 This means that if someone says something hurtful or untrue, they cannot "retract" it to stop it perpetuating on the netwok ... so decentralisation and content replication can be harmful! One of the original principals of IPFS was the "permanent web". I'm fairly certain that most users will not like the idea of "losing control" over their data, and indeed this is incompatible with EU law: https://en.wikipedia.org/wiki/Right_to_be_forgotten So we are going to usecid
as a means of ensuring uniqueness in our DB records, and we will use the concept ofprev
for versioning. See: answers 1 and 2 below. But we are not going to store textual data on IPFS for the foreseeable future, until "Filecoin" is fully operational and we have a guarantee that our data will not disappear. We can still usecid
100% independently of IPFS and when the ecosystem "matures", we can offer users of our application(s) (Time, Tudo, ALT, etc) to "backup" their data to IPFS! For now, ignore the existence of IPFS and focus purely on usingcid
to replaceentry_id
in Alog.
In a given web app, there will be a PostgreSQL database that will store the data.
Each item of content will have a cid
Imagine that we are building a "home rental" website. "restful-bed-and-healthy-breafast.com" which has the short domain: bnb.com here is a example (simplified) "homes" table:
inserted |
cid (PK)1 |
address |
slug |
prev |
---|---|---|---|---|
1541609554 | hdyk80sgPeAX | Wayne Manor, 1007 Mountain Drive, Gotham | hdy | null |
1541618643 | HvTlGsEX88Nc | Wayne Manor, 1007 Mountain Drive, Gotham | waynemanor | hdyk80sgPeAX |
1541628987 | pN7hWNuqJ6J | Wayne Manor, 1007 Mountain Drive, Gotham, USA | waynemanor | HvTlGsEX88Nc |
The first row is the "creation" of the entry for "Wayne Manor".
At this point the URL would be: bnb.com/hdy corresponding to the first 3 letters of the cid
.
The second row is when the listing owner updates the slug
to be a more friendly waynemanor
so the URL is more human memorable and SEO friendly: bnb.com/waynemanor
The URL may be longer but it's more memorable and thus people may prefer it.
Notice how the value of prev
refers to the cid
of the previous version of the record?
that's how we do versioning in a cid
based web app. (see below)
As this data will be stored "centrally" by a PostgreSQL database, the DB can be responsible for ensuring that the slug
field is not duplicated. We will need to run a "SELECT" query before inserting any record that has a slug
to confirm that the user inserting the data has the access rights to update the row with that slug
but we will clarify those "access control" details later. For now, let's stick with the simplified version.
In the third row, we added the "USA" to the address which changed the content and thus creates a new cid
. The prev
refers to the previous version of the record (before "USA" was added). The slug
has not changed, so the URL is still the same: bnb.com/waynemanor
the update version of content would be linked to the previous version using a prev
field the way it happens in IPFS, Etherium and Bitcoin (so it will be familiar to people)
prev: previous_cid
address example:
inserted |
cid (PK)1 |
name |
address |
prev |
---|---|---|---|---|
1541609554 | gVSTedHFGBetxy | Bruce Wane | 1007 Mountain Drive, Gotham | null |
1541618643 | smnELuCmEaX42 | Bruce Wane | Rua Goncalo Afonso, Vila Madalena, Sao Paulo, 05436-100, Brazil | gVSTedHFGBetxy |
When a row does not have a prev
value then we know it is the first time that content has been inserted into the database. When a prev
value is defined in a row we know this is a new version of a previously inserted content and we can "traverse the tree" to see all previous versions.
1: all
cid
values truncated for brevity.
@RobStallion please let me know if this answers your questions. 🤔 If not, please help identify the remaining confusion. thanks. 👍
@nelsonic Those are amazing thank you. Super super helpful. 👍
@RobStallion do you want to convert these questions & answers into "FAQ.md" and create a PR? 😉
@nelsonic Will do 👍
The following lines added to the read in https://github.com/dwyl/cid/pull/16 answer my first question...
The reason we can abbreviate the URL to just gV is because our SHORT URL service has a centralised Database/store. If we wanted to run a decentralised content addressing system, we would simply link to the full cid: dwyl.co/gVSTedHFGBetxyYib9mBQsjtZj4dJjQe
Closing as @nelsonic has answered my questions and they have been added to readme
@nelsonic just opening this issue as a place to ask some questions I have at the moment. Can split each question into their own issue if needed or add them to an FAQ section in the readme if you feel that would be helpful.
Questions
1.
In issue #1 you mention shortening the URL from:
to
How would we ensure that this URL is always unique. If we had millions of URLs, like youtube for example? 4 characters does not seem long enough for that to be possible.
2.
In issue #1 you mention:
My understanding of the above is that we would take all the content from the form submission and hash it generating a string (which would be used as the primary key). The same content would generate the same string so we could easily tell if it existed or not (I think this is similar to how hash tables work under the hood (thanks for suggesting that book btw 😉)). If my understanding of this is correct then my questions are...
How would we link the change to the original? Would we still be using an
alog
approach where we have a:entry_id
to keep track of this?Would the long term plan be to be just track the changes, and link to the original file? (I believe this is similar to how both git and IPFS work)