damianiandrea / mongodb-nats-connector

A connector that uses MongoDB's change streams to capture data changes and publishes those changes to NATS JetStream.
MIT License
48 stars 7 forks source link

Files too ? #1

Closed gedw99 closed 1 year ago

gedw99 commented 1 year ago

Mongodb can store files and nats can chunk them using nats object store .

this would allow reacting to file changes . But also replicating them.

damianiandrea commented 1 year ago

@gedw99 Interesting idea, can you elaborate?

For files larger than 16 MB it is better to use GridFS, but I don't know if it has Change Stream support.

For smaller files, however, why would we need to chunk them using the NATS object store?

Thanks for the input!

gedw99 commented 1 year ago

I also don’t know if GridFS supports change streams. did look at the docs and it seems to support them but until we try it not certain.

I chunking for small files under 16 mb is true..

The idea of using nats object Store is so that you can get the change stream of the file , and send it via nats. The reason is because then you can keep a cluster of mongodb nodes in sync. the other nodes get the change via nats object Store and update their own instance. I am using marmot to do this: https://github.com/maxpert/marmot

Marmot only supports SQLite currently but me and others are breaking it out to be flexible for other dbs. Marnot uses nats do do it’s thing

gedw99 commented 1 year ago

Hey @damianiandrea

i got this working.

There are as you suggested a few ways to approach this.

There are different fike systems I need to watch . Local, minio ( s3 ), mongo gridfs.

All 3 use the same semantics that you formalised for the Create, Update , Delete. So on nats it’s consistent. I am in 2 minds about if adding an indicator of providence is worthwhile. It would mean that the NATS payload has a field that indicated what system produced the change.

damianiandrea commented 1 year ago

Hi @gedw99,

Thanks a lot for your feedback, as soon as I will have some time to spare I'll look into it!

gedw99 commented 1 year ago

If it’s not needed for your project t just say :)

I reckon I can adapt what’s here to do it but have not really looked deeply into it yet

damianiandrea commented 1 year ago

Hi @gedw99,

I double checked a few things:

You were asking if files are supported, and they are. You can store files in a MongoDB collection, watch that collection, and its changes will be published on NATS JetStream. Same thing for GridFS, all you need to do is watch the files and chunks collections, this works out of the box.

What is currently not supported is to chunk and store files on NATS Object Store, but I don't think it would be necessary. If you're dealing with files larger than 16MB then you should use GridFS and let it handle the chunking/storing, chunks would have a default size of 255KB and NATS streams support messages of size up to 64MB, so I don't see any reason to use the NATS Object Store. For smaller files even more so.

Afterwards, you mentioned how you would like to watch different file systems, such as a local fs or MinIO. This would be out of scope because this is a MongoDB-NATS connector.

Let me know if you agree with my thought process or if I misunderstood something! I'm always open to suggestions :)

gedw99 commented 1 year ago

Thank @damianiandrea

really looks promising.

i currently chunk 20 gb files into and out of nats and it works well.

I was planning for mongodb to consume files and data off nats.

Then mongodb cdc would emit that the file changed allowing me to then tell downstream that it changed.

so Nats can then be used to keep many mongodb db ‘s in sync or to tell downstream workers about changes in mongodb and to do whatever - typically materialise data within sone custom transforms. —-

I am guessing you might be using mongodb as the first mutation layer and then emitting the cdc events out to nats, and on to downstream systems?

damianiandrea commented 1 year ago

Hi @gedw99,

Yes you are correct. However, it's in the plans to add data sourcing from NATS to MongoDB, so I think that's what you'll need to achieve the synchronization you were talking about! :)

gedw99 commented 1 year ago

Thanks - yep we get each other

conduit already does all this . It’s got a nats stream in it .

https://pkg.go.dev/github.com/conduitio-labs/conduit-connector-nats-jetstream

check it out !!

gedw99 commented 10 months ago

also for CDC https://github.com/conduitio-labs/conduit-connector-mongo

sorry forgot to give you this one :)