attic-labs / noms

The versioned, forkable, syncable database
Apache License 2.0
7.45k stars 267 forks source link

Reintroduce IPFS chunkstore #3799

Closed ORBAT closed 5 years ago

ORBAT commented 6 years ago

This PR reintroduces the IPFS chunkstore. It also makes spec's external protocol registration thread-safe.

Commit 217ba38b by @aboodman stated that

I think the right way to handle IPFS is to have it outside the tree, so I'm hoping to work on refactoring the code out of our repo entirely.

but as it currently stands, cmd/noms can't interact with IPFS stores unless it's in the main noms repo.

Some notes regarding this PR:

aboodman commented 6 years ago

@ORBAT - I'm really sorry about the slow reply here. I completely missed the notification about this PR due to email config. Should be fixed now.

as it currently stands, cmd/noms can't interact with IPFS stores unless it's in the main noms repo.

eep. good point.

The main reason to pull out the IPFS code was to work around some dependency issues we have experienced with it (e.g., https://github.com/attic-labs/noms/issues/3786).

I'm not sure what the right solution to that is, but on future reflection, if people are going to be actively working with the IPFS support (e.g., you), then we might as well have the code in this tree for easier maintainability rather than separate.

So I support moving this back in.

That said, I don't think that it should use the 'external protocol' concept if it is linked into the binary.

ORBAT commented 6 years ago

Hi, thanks for the reply and the feedback; no worries about how long it took.

I'll check the current IPFS dependencies and see if #3786 is still an issue. Getting rid of gx dependencies may or may not be an option, dunno yet.

ORBAT commented 6 years ago

Hi @aboodman, I've got a couple of questions (with likely more incoming since I'm trying to learn the ropes of noms internals, so please bear with me.)

Blockstore / Blocks

Regarding the reasoning behind having ipfs-local use Blockstore, instead of creating an offline node with core.BuildCfg{Online: false}. Was the intention to have the node always be online so the pubsub subsystem will still work regardless of whether the user wants the actual blocks to be exposed to the wider IPFS network?

Do blocks stored directly via the Blockstore API never get exposed to the network? To me it seems like the BlockService always gets initialized with a reference to the Blockstore, and the the Exchange is networked if cfg.Offline is true. So if I've understood this correctly, adding blocks via Blockstore does expose them to the network; it's just that the current IPFS chunkstore code limits the its ability to get blocks from the network by only using Blockstore in "local mode."

The reason I'm asking this is that it'd be neat to have just one protocol, ipfs, for both the local and networked use cases, with the configuration defining whether to enable networking or not. As far as I can tell, this'd remove the need for the different code paths for local vs networked chunk stores (if cs.c.local { ... } else { ... }). Of course, having a non-networked node would mean the pubsub system wouldn't work, so that'd require some refactoring of the IPFS example.

Concurrency

What was the reasoning behind limiting concurrency to 1 in the chat example? Is it mainly to limit the number of concurrent readers/writers of the root hash file? Or is the Blockstore not thread-safe? I haven't looked into that yet, and IPFS's docs didn't seem to mention anything about thread-safety at a quick glance.

Chunkstore roots / Rebase

I'm a bit unclear on the concept of root and Rebase.

I sort of understand root's meaning for datasets and databases, but what's its meaning in the chunkstore context? ** The database root is Map<String, Ref<Commit>> pointing to datasets, but if I've understood correctly, multiple databases can be backed by the same chunkstore. Does a chunkstore root point to 1+ database roots?

What is Rebase used for, in both the database and chunkstore contexts? As in, in what sort of situations would one need to do a rebase? I've seen it used in e.g. the memory stores with views, which is understandable (brings the view in line with the "top level" memory store), but I'm not clear on its use in other cases.

aboodman commented 6 years ago

@ORBAT - I'm so sorry (again).

I'm not sure why I keep missing these mails. If I'm unresponsive, can you please just ping the thread again. I'm trying to spend more time on Noms the last few weeks, and I'd love to have contributions. I think right now the traffic is just low enough that it's easy to miss in my inbox.

You can also just shake my tree at any other channel (e.g., aboodman on Twitter). But hopefully now that I'm spending time on this again, it won't continue to happen.

Regarding your questions:

more incoming since I'm trying to learn the ropes of noms internals

Awesome.

Regarding the reasoning behind having ipfs-local use Blockstore, instead of creating an offline node with core.BuildCfg{Online: false}. Was the intention to have the node always be online so the pubsub subsystem will still work regardless of whether the user wants the actual blocks to be exposed to the wider IPFS network?

It's hard to remember, and it's possible we didn't even explore the Online flag. But it's true that we would have wanted the pubsub system to work in both modes.

The ipfs/ipfs-local distinction was a bit of a quick hack and with more time I hope we could do something better. IIRC (and my memory is very foggy) the issue was that Noms has a deep assumption that a dag is always complete. Meaning that if a particular chunk is present, then all chunks reachable from it will also be present.

Because of the magic of the IPFS network, with ipfs we would frequently end up with chunks present locally but not their children. This was bad news for Noms.

I don't think we ever really got the ipfs scheme working well.

Do blocks stored directly via the Blockstore API never get exposed to the network? To me it seems like the BlockService always gets initialized with a reference to the Blockstore, and the the Exchange is networked if cfg.Offline is true. So if I've understood this correctly, adding blocks via Blockstore does expose them to the network; it's just that the current IPFS chunkstore code limits the its ability to get blocks from the network by only using Blockstore in "local mode."

The latter was the goal IIRC. Perhaps the special case in the put path is not necessary!

The reason I'm asking this is that it'd be neat to have just one protocol, ipfs, for both the local and networked use cases, with the configuration defining whether to enable networking or not. As far as I can tell, this'd remove the need for the different code paths for local vs networked chunk stores (if cs.c.local { ... } else { ... }). Of course, having a non-networked node would mean the pubsub system wouldn't work, so that'd require some refactoring of the IPFS example.

I actually came to the conclusion over time that the ipfs path was not very useful. If I were doing this in a principled way I would just be using pubsub and libp2p directly, and then having them talk to a normal nbs database.

Concurrency

What was the reasoning behind limiting concurrency to 1 in the chat example? Is it mainly to limit the number of concurrent readers/writers of the root hash file? Or is the Blockstore not thread-safe? I haven't looked into that yet, and IPFS's docs didn't seem to mention anything about thread-safety at a quick glance.

No idea! Can you give me a pointer to the code in question?

Chunkstore roots / Rebase

I'm a bit unclear on the concept of root and Rebase.

I'm so sad I didn't catch this question faster. Cool to have someone digging in.

I sort of understand root's meaning for datasets and databases, but what's its meaning in the chunkstore context? ** The database root is Map<String, Ref<Commit>> pointing to datasets, but if I've understood correctly, multiple databases can be backed by the same chunkstore. Does a chunkstore root point to 1+ database roots?

It's not the case that multiple databases can be backed by one ChunkStore.

The difference between ChunkStore and Database is basically that Database knows about types.Value and ChunkStore doesn't.

So when you write something to Database the thing you write is a types.Value. Database does all kinds of validation to ensure the system at the end of the day still looks like a valid database. It makes sure that when you commit a value, all the values it depends on are present.

ChunkStore doesn't do any of that. You can, erm, put chunks in it, and get them out. The only other thing of significance that you can do with it is read/write to a single unnamed mutable register. Writes to this register have certain concurrency guarantees on which the entire system depends.

Basically if you think of a stack of components, with your code at the top, and the OS at the bottom, Noms is in the middle. The ChunkStore interface is the boundary between the OS and Noms. It represents the minimum requirements Noms has on the underlying storage system in order to function properly.

Database is the boundary between Noms and you. So whereas ChunkStore's design goal is to be absolutely minimal, Database's design goal is to be high-level, useful, ergonomic, etc.

What is Rebase used for, in both the database and chunkstore contexts? As in, in what sort of situations would one need to do a rebase? I've seen it used in e.g. the memory stores with views, which is understandable (brings the view in line with the "top level" memory store), but I'm not clear on its use in other cases.

Rebase() is useful occasionally when you know that the underlying storage may have changed and you want to bring yourself up to date with it.

Think of like a webserver that is servicing requests. If this server caches a single database instance across all requests (which would be a reasonable thing to do) then the server would want to rebase the db before each request to ensure it is seeing the latest state of the world.

The reason it is called Rebase is because like git rebase it brings along its current state. So for example if you do Put a few times, then all Rebase, then call Commit, the chunks that you put will still be visible to commit. Whereas if you just dropped the db and built a new one, you'd lose that state. I think there is also some cached state that comes along for the ride too, which makes Rebase more efficient that simply building a new database instance, but can't recall the details.

===

HTH, and hope you're still interested.

ORBAT commented 5 years ago

Closing this for now since I've moved onto other projects