RobStallion commented 5 years ago

@nelsonic just opening this issue as a place to ask some questions I have at the moment. Can split each question into their own issue if needed or add them to an FAQ section in the readme if you feel that would be helpful.

Questions

1.

In issue #1 you mention shortening the URL from:

location-app.com/venues/123e4567-e89b-12d3-a456-426655440000

to

location-app.com/sw1x

How would we ensure that this URL is always unique. If we had millions of URLs, like youtube for example? 4 characters does not seem long enough for that to be possible.

2.

In issue #1 you mention:

I suggest that by using a hash of the content as the Primary Key, Ecto (or PostgreSQL) would "reject" the insert request as a "duplicate" and we would not waste space in the database/table with dupe data.

My understanding of the above is that we would take all the content from the form submission and hash it generating a string (which would be used as the primary key). The same content would generate the same string so we could easily tell if it existed or not (I think this is similar to how hash tables work under the hood (thanks for suggesting that book btw 😉)). If my understanding of this is correct then my questions are...

How would we link the change to the original? Would we still be using an alog approach where we have a :entry_id to keep track of this?

Would the long term plan be to be just track the changes, and link to the original file? (I believe this is similar to how both git and IPFS work)

nelsonic commented 5 years ago

Hi @RobStallion, thanks for posting these good questions as an issue. 🎉

in future consider opening separate issues for each question you have for clarity 💭 and making the question the title of the issue for SEO benefits 🔍 and making the life of the "next dev" easier when they have similar questions ... 😉

Answers

1. Uniqueness?

cid will be universally unique.

That needs to be clear to everyone from the first line of the readme. (if it's not, then it's "my bad" and I need to "fix" it ...)

We will be using the SHA256 hash which to date has not incurred a single hash collision. 256 bits of data is used by most crypto currencies. Enough "smart people" have done the homework on this for us not to worry about it.

The "math" is covered in: https://crypto.stackexchange.com/questions/39641/what-are-the-odds-of-collisions-for-a-hash-function-with-256-bit-output

This is the best video on 256 bit hash collision probability: https://youtu.be/S9JGmA5_unY

This video does not cover the "Birthday Paradox" see: https://github.com/nelsonic/nelsonic.github.io/issues/576 But again, for the purposes of this answer and indeed any project we are likely to work on in our lifetime, when dealing with 256 bit hashes, the chance of a "birthday attack" creating a collision is "ignorable".

`cid` means we have the `<option>` to store content on IPFS

We need to make it clear that using a cid as the unique identifier for a record means we can optionally store content on IPFS for redundancy/decentralisation, but for the purposes of building our Apps in 2019, we are NOT going to even try to build "D-Apps" because unfortunately there is no way of maintaining privacy for private/personal content on IPFS without pre-storage encryption which then automatically implies storing encryption/decryption keys somewhere centrally. i.e. something will need to be stored centrally, so we might as well store the data centrally to reduce query time and request latency in any App(s) we build.

Note: the reason I haven't previously "proclaimed" in https://github.com/dwyl/technology-stack/issues/67 that "all our apps will be distributed by default from now on", is because the Application building "story" is incomplete on IPFS/IPLD.
There is no way of deleting old data that people no longer want to exist: https://github.com/ipfs/faq/issues/9 This means that if someone says something hurtful or untrue, they cannot "retract" it to stop it perpetuating on the netwok ... so decentralisation and content replication can be harmful! One of the original principals of IPFS was the "permanent web". I'm fairly certain that most users will not like the idea of "losing control" over their data, and indeed this is incompatible with EU law: https://en.wikipedia.org/wiki/Right_to_be_forgotten So we are going to use cid as a means of ensuring uniqueness in our DB records, and we will use the concept of prev for versioning. See: answers 1 and 2 below. But we are not going to store textual data on IPFS for the foreseeable future, until "Filecoin" is fully operational and we have a guarantee that our data will not disappear. We can still use cid 100% independently of IPFS and when the ecosystem "matures", we can offer users of our application(s) (Time, Tudo, ALT, etc) to "backup" their data to IPFS! For now, ignore the existence of IPFS and focus purely on using cid to replace entry_id in Alog.

Uniqueness in a Phoenix-based Web Application

In a given web app, there will be a PostgreSQL database that will store the data. Each item of content will have a cid

Imagine that we are building a "home rental" website. "restful-bed-and-healthy-breafast.com" which has the short domain: bnb.com here is a example (simplified) "homes" table:

`inserted`	`cid`(PK)¹	`address`	`slug`	`prev`
1541609554	hdyk80sgPeAX	Wayne Manor, 1007 Mountain Drive, Gotham	hdy	null
1541618643	HvTlGsEX88Nc	Wayne Manor, 1007 Mountain Drive, Gotham	waynemanor	hdyk80sgPeAX
1541628987	pN7hWNuqJ6J	Wayne Manor, 1007 Mountain Drive, Gotham, USA	waynemanor	HvTlGsEX88Nc

The first row is the "creation" of the entry for "Wayne Manor". At this point the URL would be: bnb.com/hdy corresponding to the first 3 letters of the cid.

The second row is when the listing owner updates the slug to be a more friendly waynemanor so the URL is more human memorable and SEO friendly: bnb.com/waynemanor The URL may be longer but it's more memorable and thus people may prefer it.

Notice how the value of prev refers to the cid of the previous version of the record? that's how we do versioning in a cid based web app. (see below)

As this data will be stored "centrally" by a PostgreSQL database, the DB can be responsible for ensuring that the slug field is not duplicated. We will need to run a "SELECT" query before inserting any record that has a slug to confirm that the user inserting the data has the access rights to update the row with that slug but we will clarify those "access control" details later. For now, let's stick with the simplified version.

In the third row, we added the "USA" to the address which changed the content and thus creates a new cid. The prev refers to the previous version of the record (before "USA" was added). The slug has not changed, so the URL is still the same: bnb.com/waynemanor

2. Updating Content

the update version of content would be linked to the previous version using a prev field the way it happens in IPFS, Etherium and Bitcoin (so it will be familiar to people) prev: previous_cid address example:

`inserted`	`cid`(PK)¹	`name`	`address`	`prev`
1541609554	gVSTedHFGBetxy	Bruce Wane	1007 Mountain Drive, Gotham	null
1541618643	smnELuCmEaX42	Bruce Wane	Rua Goncalo Afonso, Vila Madalena, Sao Paulo, 05436-100, Brazil	gVSTedHFGBetxy

When a row does not have a prev value then we know it is the first time that content has been inserted into the database. When a prev value is defined in a row we know this is a new version of a previously inserted content and we can "traverse the tree" to see all previous versions.

¹: all cid values truncated for brevity.

@RobStallion please let me know if this answers your questions. 🤔 If not, please help identify the remaining confusion. thanks. 👍

RobStallion commented 5 years ago

@nelsonic Those are amazing thank you. Super super helpful. 👍

nelsonic commented 5 years ago

@RobStallion do you want to convert these questions & answers into "FAQ.md" and create a PR? 😉

RobStallion commented 5 years ago

@nelsonic Will do 👍

RobStallion commented 5 years ago

The following lines added to the read in https://github.com/dwyl/cid/pull/16 answer my first question...

The reason we can abbreviate the URL to just gV is because our SHORT URL service has a centralised Database/store. If we wanted to run a decentralised content addressing system, we would simply link to the full cid: dwyl.co/gVSTedHFGBetxyYib9mBQsjtZj4dJjQe

RobStallion commented 5 years ago

Closing as @nelsonic has answered my questions and they have been added to readme

dwyl / cid

Questions #10

Questions

1.

2.

Answers

1. Uniqueness?

`cid` means we have the `<option>` to store content on IPFS

Uniqueness in a Phoenix-based Web Application

2. Updating Content

dwyl / cid

Questions #10

Questions

1.

2.

Answers

1. Uniqueness?

cid means we have the <option> to store content on IPFS

Uniqueness in a Phoenix-based Web Application

2. Updating Content

`cid` means we have the `<option>` to store content on IPFS