[Question: devel branch] HTTP access point API beyond GET?

vsivsi commented 10 years ago

First off, kudos to the devs for the new version of CollectionFS, it's looking great!

In the docs I see references to the HTTP access point, and it is easy to verify that this is working (for download via GET) using the {{url}} handlebars helper. So I see something like http://localhost:3000/cfs/files/filesColl/zjAqp9ceEZPeezyWy/filesGrid, which gives the API pattern for an existing file.

Question: How do I use the HTTP access point to upload a new file, or remove/replace an existing one? What I'm looking for is the equivalent to the CouchDB file attachment API here. I see lots of comments suggesting that this functionality already exists, but my initial attempts to ferret out how it works have failed.

I'd be happy with something as simple as an example curl -X PUT -T 'filename' based file upload command. All of my attempts thus far have resulted in a hail of Not Found [404] and Error: APUpload expects binary data messages.

Any help that prevents me from having to reverse engineer how this works would be much appreciated!

aldeed commented 10 years ago

We haven't tested this in awhile, but it should work with a PUT to the same URL, minus the store name (http://localhost:3000/cfs/files/filesColl/zjAqp9ceEZPeezyWy). Obviously this requires that you first insert the file into the collection.

We probably should insert the file into the collection and then store the data when params.id isn't provided. Then you wouldn't have to necessarily do your own insert first.

aldeed commented 10 years ago

The expects binary data message might be because we're checking to see if it's Uint8Array, but it's probably a node Buffer when done through HTTP. I'll see about a fix.

vsivsi commented 10 years ago

Yes, I can verify that if I do:

curl -X PUT -T image.png 'http://localhost:3000/cfs/files/filesColl/qjAqp9ceEZuc56XWy/'

Where that file already exists, I get back:

 Error in method "/cfs/files/:value/:value/:value/", Error: Error: APUpload expects binary data
     at Object.APUpload (packages/cfs-access-point/accessPoint.js:5)
     at Object.<anonymous> (packages/cfs-access-point/accessPoint.js:239)
     at packages/http-methods/http.methods.server.api.js:417

vsivsi commented 10 years ago

We probably should insert the file into the collection and then store the data when params.id isn't provided. Then you wouldn't have to necessarily do your own insert first.

How does any kind of authentication happen in this case? For GET/PUT/DELETE with a bare URL like this, it depends on the obscurity of the file id (which you presumably had to authenticate to create/learn). Is there some way to pass the current user token in the HTTP header?

My use case for this question is a "worker" process on the server side (though perhaps not on the same machine) that is connecting via DDP to get "jobs" that require heavy background processing (similar to what Kue does). It's straightforward using DDP to authenticate a worker and have it use method calls and subscribes to receive and communicate status about tasks. What's missing is a way to pass large files back and forth between the worker and server. Simply supplying HTTP URLs to the input and output files for the worker to GET/PUT would be a great solution.

aldeed commented 10 years ago

Your worker process could also insert the files and then chunk upload using DDP, which is what CFS does by default right now. That said, we're probably going to switch to HTTP for uploads because there are some performance issues with using DDP for binary data.

HTTP uploads is something we've designed it to support, so we just need to fix this issue you're getting. Still investigating.

aldeed commented 10 years ago

OK, @vsivsi, I pushed some support for this. It still needs work, but it's at least working now.

To upload a new file:

$ curl -X PUT -T "My Test Picture.png" -H 'Content-Type: image/png' http://localhost:3000/cfs/files/filesColl/

You need to add the content-type header, but otherwise what you were trying before is correct. Don't include an ID.

If you do include an ID, it should theoretically overwrite that file, but I didn't test that and I'm guessing it's not working correctly yet.

vsivsi commented 10 years ago

Great! Thanks, I'll test it out in the morning.

A followup to my question above regarding authentication over HTTP: I notice that if I'm logged in, the {{url}} helper generates URLs like this:

 http://localhost:3000/cfs/files/filesColl/NvmYH63pTfijuccXa/filesGrid?token=5QRGg8pgk8rRmswfL

With the ?token=XXX parameter. Is this the mechanism for passing the token for all HTTP requests? In testing this for HTTP GET, I'm noticing that it doesn't seem to matter whether I provide the token or not, even though I have deny code like this in effect:

  filesColl.deny({
     download: function (userId, file) { 
         if (file.metadata.owner !== userId) {
             return(true);
         } else {
             return(false);
         }
        // And so on for insert, update, remove
     });

The update, remove and insert deny rules seem to work correctly within the browser. But the download deny never seems to have any effect on the HTTP GET method. The documentation on this is a little thin, so I may be doing it wrong. Thoughts?

vsivsi commented 10 years ago

Too impatient to wait for morning... Your change does work for me for uploading a new file, thanks so much!

I can also verify that the allow/deny rules do not currently seem to be enforced on HTTP requests. The curl -X PUT ... example succeeds even when I have the following rule in place (that successfully prohibits inserts from the browser via DDP):

 filesColl.deny({ insert: function (userId, file) { return(true); }});

I totally understand that this is under development, but it might be worth a prominent warning in the devel branch docs that the HTTP access to CollectionFS (which is enabled by default) is currently wide open.

vsivsi commented 10 years ago

One bit more information: The allow/deny functions don't even get called for HTTP requests.

aldeed commented 10 years ago

I thought auth was working for me. Do you have insecure pkg removed? I'll test again later today.

aldeed commented 10 years ago

I just verified that the allow/deny functions are correctly called for GET and PUT, and when I have deny return true, I get a Forbidden/Access Denied response. I didn't double-check to see if the userId is correctly set, but if you pass the token= query string param, it should be.

raix commented 10 years ago

@aldeed I'm thinking about a way to sign / auth http via the ddp connection - maybe have a way of signing / crypt the user data - The current http.methods implementation does a db call to look up the user to validate - this info could be extracted from the crypted user data - avoiding db access. (this pattern could result in extracting fileworker+http access point and storage adapters into node.js packages - making it a more separated structure - just saying)

vsivsi commented 10 years ago

@aldeed, I am running with the insecure package. Although that doesn't really explain what I'm seeing, because according to the Meteor Docs, insecure should have no effect if any allow/deny rules at all are declared on a collection:

Meteor also has a special "insecure mode" for quickly prototyping new applications. In insecure mode, if you haven't set up any allow or deny rules on a collection, then all users have full write access to the collection. This is the only effect of insecure mode. If you call allow or deny at all on a collection, even Posts.allow({}), then access is checked just like normal on that collection.

I was setting deny rules and they were being respected on my collectionFS when accessed via DDP, so this behavior of: "ignore access rules for HTTP requests when insecure is installed" is unexpected and seemingly incorrect.

tl;dr It shouldn't matter that I am running insecure because I defined allow/deny rules on this collection.

aldeed commented 10 years ago

That's probably our fault for implementing it incorrectly then. Also, I realized that there might actually be an issue because I was testing with a local codebase that was missing some recent changes. I'm looking into it.

vsivsi commented 10 years ago

Okay, I've removed the insecure package and tried again, and now I'm seeing a different set of issues related to the userId values being provided to the allow/deny rules.

Here's the setup: I've defined a set of allow rules for each of insert, remove, update, and download. When logged in to the client, if I set all of the allow rules to return true everything works great.

But if I add a simple ownership test ala file.metadata.owner === userId then all hell breaks loose.

Here's what I see happening in this case: correct userId values are passed to my allow functions for insert and remove, but userId === null for update and userId === undefined for download.

The impact of this is that when I upload a file, the initial collectionFS document is created (the insert succeeds) but the subsequent update to add the chunks and copies attributes fail and the upload fails on the client with an exception Error: "Queue" Access denied [403] [403].

If I relax the update allow rule by setting it to always return true, then uploads succeed. However, downloads via HTTP still fail because the download rule isn't being provided with the valid userId.

Removing the file via DDP works correctly, with the correct userId value provided.

aldeed commented 10 years ago

I think the issue you're seeing now is the same as #166. Something about our code that is "logging in" on secondary DDP connections isn't working anymore. Is this on Meteor 0.7.0.1 or the new RC release?

vsivsi commented 10 years ago

To add to my above comment: the undefined userId being passed to the download rule is coming from an HTTP GET with the token=XXX query parameter correctly set.

I'm running this on the latest commit of the shark branch, which has a bunch of stuff post v0.7.0.1 merged into it. Specifically I'm running https://github.com/meteor/meteor/commit/1d24565fb5ed6a5571cd6c037f1a60a092be8eaa

vsivsi commented 10 years ago

Agreed that https://github.com/CollectionFS/Meteor-CollectionFS/issues/166 probably explains the update issue I reported above. However, I think the HTTP GET download issue is probably distinct since it is getting a different value (undefined vs null) and that case doesn't involve DDP.

aldeed commented 10 years ago

I just pushed some changes (includes slightly different URL syntax, too, so beware). In my testing, userId is coming through fine with HTTP GET and token. If it's still not working for you after taking the latest, then there must be something on shark that's different.

vsivsi commented 10 years ago

@aldeed, May I make one more (hopefully) relatively simple request related to the HTTP PUT functionality you got working for me last night?

As of right now, the PUT works, in that it creates/uploads a file. But what is missing (for my use case at least) is some mechanism for the worker to know which uploaded file is the results of its work (I'd prefer not to rely on the uniqueness of filenames). Right now, once that PUT happens, the worker/client is in the dark about where the file went or even if the upload really worked.

So what I need is either:

For the PUT to somehow return the _id of the newly created file (the reply is currently empty).
For PUT to work correctly in replacing an existing file by _id.

Arguably both of those should work at some point, but I'd be happy with one or the other for now.

aldeed commented 10 years ago

Also, the #166 issue is fixed, or more accurately, "worked around".

aldeed commented 10 years ago

Re PUT, I have the HTTP method returning {_id: file._id}, but doesn't seem to be coming through. There's probably some different way to do it that would work. Maybe I have to stringify it? I'll have to investigate, or @raix probably knows.

vsivsi commented 10 years ago

I just tried everything again using the latest versions of all collectionFS components and there are some improvements. The update allow rule now gets the correct userId, and the upload succeeds. Yay!

I do see an error in the client console Exception while simulating the effect of invoking '/cfs/files/put' and the error Object contains a Access denied [403]. This only occurs in conjunction with an upload.

Once the file is uploaded however, I'm still seeing 403 errors on HTTP GET requests caused by userId === undefined in the allow download rule function. The URL I'm using (provided by {{url}} is now:

 http://localhost:3000/cfs/files/filesColl/filesGrid/universe.png?token=XXXXXX

vsivsi commented 10 years ago

OK, I "desharkified" my test app and tried this out on both Meteor v0.7.0.1 and the new v0.7.1.0 RC2. It works as you describe on v0.7.0.1 and fails identically as it did on shark with the RC2 build.

So whatever changes cause it to fail on shark are about to be released when v0.7.1.0 drops.

vsivsi commented 10 years ago

I've found and fixed the HTTP authentication problem in my local repo. The issue is due to changes in Meteor involving the hashing of active tokens in the user db. A backward compatible fix (works for both 0.7.0.1 and 0.7.1.0 / shark ) involves a single line change in Meteor-http-methods / http.methods.server.api.js.

Use this line instead:

var user = Meteor.users.findOne({ $or : [{'services.resume.loginTokens.token': userToken },
                                         {'services.resume.loginTokens.hashedToken': Accounts._hashLoginToken(userToken) }]});

aldeed commented 10 years ago

Ha, I just made that same change before I saw your latest comment. Wish I had seen it first as it would have saved me some hunting. :) Anyway, the http-methods package is fixed now. Thanks!

@raix, FYI.

vsivsi commented 10 years ago

@aldeed,

tl;dr: I think there is a compelling case to revert to using _id in URLs instead of key, and to strongly consider using _id as the storage adapter key as well...

Okay I've been testing with all this and I've run into an issue regarding the URL format change you recently introduced (now using the file's key instead of the id in the URL)

The issue arises in that filenames in a CollectionFS collection are not guaranteed to be unique; in essence they are just another bit of metadata. In my testing, a CollectionFS is happy to accept any number of identically named files (which is the correct design IMO). The way this is managed internally (given that both GridFS and the OS filesystem require uniquely named files) is that CollectionFS maintains distinct entries for name (non-unique metadata) and key (must be unique, used as GridFS and/or filesystem filename).

The current implementation attempts to use name for key and resolves file key collisions with a simple scheme that adds an incrementing number to the base name of the file (foo.bar, foo1.bar, foo2.bar ...) And herein lies my issue:

If I use HTTP PUT to insert a file, I know the name of the file I added, and in theory I can construct the URL to GET it back. In practice, using this approach, I'm not at all guaranteed to get the same file back, because if I insert foo.bar and a file with that key already exists, then my new file's key will be foo1.bar unless that already exists... And this will be even more problematic in the face of a working implementation of HTTP DELETE (or PUT that can replace an existing file).

Having thought about this for a day or so, I'm pretty convinced that the only sane way to manage this is to return to using the guaranteed unique and unchanging CollectionFS file _id for all HTTP URL operations. In addition, it also seems more sane to just use that _id as the Storage adapter key as well (retaining the extension if desired), and dispense with the "test and renumber" logic that adds complexity but little if any value.

aldeed commented 10 years ago

@vsivsi, I agree with you mostly, but there is at least one gotcha you're not considering, which is that we intend for storage adapters to be able to optionally sync back to the FS.Collection and any other defined stores. That's where some complexity arises because we can't simply assume a file key/name based on ID will be unique. Nevertheless, I think you're on the right track.

These are some requirements as I see them:

Storage adapters (SAs) must be a black box. We know they have certain methods, put, get, etc., but beyond that we should not care what they do.
SA put job is to receive a file, store it, and return a "key" that uniquely identifies that file within the context of that store. Currently we pass in a "recommended key" that is used as the basis, but we of course must alter it to be unique. The only reason we currently pass the filename as the "recommended key" is because that makes it fairly easy to identify files if you're looking at them directly on the SA drive. We could in theory pass anything or nothing as the "recommended key".
The other SA methods accept that same "key" as an identifier and do something to that file (get, delete)
The GET URL needs to uniquely identify a file in a particular store.
The GET URL should ideally end with a file extension for readability, but need not.
The GET URL should be constructable even if we haven't yet stored the uploaded file anywhere.
The PUT URL needs to uniquely identify a collection and provide a file name at the end.
The PUT URL needs to additionally identify a file ID, ONLY IF we are overwriting an existing file.

So then. Why shouldn't we use the store key in the GET URL? Because this fails one of the requirements: "should be constructable even if we haven't yet stored the uploaded file anywhere". Instead, as you suggest, we can revert to using the ID and store name to construct the GET URL.

What about adding the file extension for readability? Well, I don't think we can reliably do that because we don't know the final file extension until the store finishes saving the file (could be changed by beforeSave function). We could use the extension of the uploaded file at first, but that could lead to confusion if it changes and would mess with caching. Thus I'm afraid we won't be able to have an extension on the GET URL.

So I propose these URLs, probably similar to what they were before:

GET: baseUrl + '/:collectionName/:id/:store'
PUT (insert): baseUrl + '/:collectionName/:newfilename'
PUT (update): baseUrl + '/:collectionName/:id/:newfilename'
DEL: baseUrl + '/:collectionName/:id'

PUT (insert) sends back the new _id.

Regarding the "recommended key" and the final key: My current thinking is that we should pass the entire FS.File instance to the SA put method and let each SA do whatever it wants to construct a unique file key. In general, they would probably do file._id + "-" + file.name and then alter that to be unique if necessary before saving it.

@raix, let me know if you disagree with any of this. This discussion seems to indicate that your idea about using the store key as the URL won't work. If I'm misunderstanding your idea, let me know.

vsivsi commented 10 years ago

@aldeed, Thanks for the detailed response. I'll probably have more questions/comments about the above after some time to think and peer into the code a bit more, but right off the bat one of your stated requirements puzzles me:

The GET URL needs to uniquely identify a file _in a particular store_.

Why is it necessary for the HTTP client to even be aware that there are stores? There are two related reasons I think this may be undesirable.

1) Philosophical: Leaky abstraction. You've all gone to great lengths to build up a consistent "Collection" abstraction here, only to allow a bit of "how it's stored" detail to leak into the picture for the client.

2) Practical: If I PUT a file: baseUrl + /:collectionName/:newfilename and then want to use/communicate that file's GETable URL, how do I know what URL to construct? Clearly the PUT needs to return the id, but in the proposed scheme, it also needs to return a valid store, which seems to violate:

The GET URL should be constructable even if we haven't yet stored the uploaded file anywhere.

Beyond the philosophical objection to this above, it seems to me that store need not be unique and/or static. This raises the prospect of GET URLs that become obsolete simply because some internal storage detail changed. It also implies that the client should have some say into precisely how a GET request is satisfied by the server (IMO, a detail better left to the parts of the system best informed to make that decision)

In short, it seems that the store isn't strictly necessary to satisfy a GET request, and requiring it leads to a bunch of potential problems.

raix commented 10 years ago

@vsivsi: Client knowledge of stores The client need to know what store you want to download - you could have a store that contains "thumbnails" or "original" etc. We did use the naming "copies" prior but its really a concept of having multiple stores attached - that said one could reuse storages adapters (SA) eg. have multiple stores use same SA.

Transfer http only The current client needs to know what store to use for initial upload - We want to allow the client to upload directly to S3 - then have it trigger a sync of the multiple stores. This pattern could change - I'm currently preparing to replace http upload (instead of using the current ddp) The reason for this is:

MDG has confirmed that the current ddp is not suited for file upload (we have encountered alot of issues - why we are delayed in our work - trying to solve it)
We can actually make the client-side code lighter, specially if we adapt the pattern of s3 chunked http upload - But we have to write the stuff from ground up - the HTTP.call/put/get.. is not and will not be improved to support file upload functions as abort and progress (again @meteor decision)

In reality we could simply have the http as default upload / transfer - the server would know how to handle an upload to a FS.Collection. It would also make it a bit easier to write / use libraries from ios/android etc. since http is widely adapted. So that said the client only needs to know what server and what collection to target.

We will prop. be implementing the s3 way of signing - it works well with Meteor. I've already created a working implementation - not released (to other than @aldeed)

Naming key/urls pattern Regarding the naming convention - we are working on a pattern - its going to be understandable - you could simplify it but it would come at a cost. as @aldeed correctly points out one of the patterns that I recently suggested has an implication by the extension. We are trying to abstract and isolate the code modules / packages into small black boxes. It will make it easier to extend and debug. Having 3 SA's already is really nice when debugging - its a great way to isolate problems and try to solve them.

The url pattern and key needs to consist of the following:

collectionName - we need a target container / bucket
id - the file id - we need a reference to the actual data or chunks of data / metadata etc.
store - this is optional - if empty we serve the "master" / primary / first SA in the collection So thats the basics, right?

We also have the filename as the last one - this is actually causing problems with our architecture (trying to separate code in reusable packages) Now why bother with it? - one could say that from the user perspective its human readable - thats important too, right? Specially when/if the user wants to download the file locally

Future ideas @aldeed What I've been thinking about lately is making the use of filename optional - The SA will not have anything to say about naming conventions - It will be told the key - how it handles the actual storage is SA specific. Now the beforeSave on the other hand will have something to say when it comes to the extension - but thats about it - we can grap this on the fly?

collection/id // this is the minimum for downloading a file
collection/id/store // we select a specific store
collection/id/store/name.ext // this is the full key when the fileWorker adds a file to urls / stores
collection/id/name.ext // this is invalid - will never be used since urls contains the store too

the keys are actually the urls - urls will have to be prefixed by the default /fs/files or custom user defined mount point

weeyoung commented 10 years ago

Hi, Great work so far for CollectionFS. i agree that that the GET URL needs to uniquely identify a file in a particular store. It does makes sense and implement the naming pattern proposed by @raix. I just started using CollectionFS and in particularly cfs-s3, hence may not understand this fully. I do have some points (some 5 cents worth).

I guess, to me, the most common situation, is that different user may have the same file, Thus, we need to uniquely identify each file. To a single user, they would normally have differentiate their own files. I would feel one possible use case is to use userid as the id (configurable). e.g. collection/userid/store/name.ext id to be the userid (if userid is present). This way, it could be easier/efficient to grab/ list collections by the particular user.
I was pondering about the thumbnails. I see that the current way of implementation is that to use Meteor-cfs-graphicsmagick and create a copy as a form of a thumbnail. It would be great to be able to link the thumbnail/actual photo(e.g. both are in the same photo)
The thumbnail would be created on saving of the photo. Though I think that the thumbnail would be viewed almost instantaneously. Would it be better to have save the thumbnail upon first retrieval of the photo? It could be more efficient that way? Or maybe have a GET with the parameters.
In a particular use case, what happens when I decide to change the size of the thumbnail? That would be a tough problem to try to change the photos size. It would be better to have a thumb photo to have a fixed name in the folder, and this may allow us to override the photo if the md5/metadata is different?

raix commented 10 years ago

We solved these things in following pattern:

User id goes into metadata - you can edit allow/deny rules for security and use collection.find() in your publish - only publishing the files where userId is in the metadata
Thumbnails and original etc. are linked - the FS.Collection will contain urls (these are keys that link the files)
We have recently changed the api so that we specify an order in witch the fileWorker should process the stores in the collection - So you can have the thumbnails be saved/generated and published first.
If you want to trigger a rerun of all the thumbnails there are two solutions:
1. remove the store from each file urls object
2. add a new store with a different name and remove the old one ('size40x40' replaced with 'size50x50')

both will result in a missing store in urls and the fileworker should recover - 1. is the mechanism also used when synchronizing - the SA will simply overwrite the file physically

We have actually also added a simple way for you guys to easily create public shared urls that could be sent to an user without login - and again easily revoke the access.

I understand that creating a thumbnail is important but we are also going to provide a scheme making the image available in url even before uploaded - this is our latency compensation - again working in our private prototype.

aldeed commented 10 years ago

@raix:

The SA will not have anything to say about naming conventions - It will be told the key - how it handles the actual storage is SA specific.

This is something I've tried, but it's not practical for at least three reasons:

The SA does not necessarily have a way of remembering which key was provided to it for each file. For example, if the filesystem SA gets a file with a key, it will convert that key into a unique file name, but where does it then link the unique filename with the provided key? Maybe in file metadata, but that would get slow for lookups. Before we could have used (and did use) the SA .files collection to do this mapping, but I removed it.
Providing the key would cause problems with sync, too. Thinking of the packages as black boxes, the SA would not know what syntax is expected when passing a key to the collection for a file stored directly into the store and sync'd back.
If you tell the SA what key to use and it includes an extension, the SA beforeSave function might change the file format and extension, and therefore the key's extension could be misleading and definitely could not be used in the GET URL.

I'm really thinking that the most "black box"ish way to do it is to pass the FS.File instance to the SA and let it use whatever info it wants to construct and return whatever type of key works best. Specifically:

Filesystem SA will do file._id + "-" + file.name and potentially append numbers to make it unique.
GridFS SA will save into the .files and .chunks collections and then return the .files _id as the unique key.
S3 SA might save to S3 bucket with file._id + '/' + UUID + file.name as the S3 key, doing a GetObject first to make sure that key doesn't exist yet.

The only downside of this approach is that we cannot possibly have an accurate file extension on the URL. Or to do so, we would need to have the developer define a savedExtension: 'png' option on each store that has a beforeSave.

@vsivsi, regarding the client knowing about stores, @raix explained it pretty well, but to put it another way: After uploading a file, it no longer exists as a single file. Instead we store copies of it in one or more places (stores) and we potentially manipulate/alter those copies. The original file may not even be stored. For images, you might store thumbnail, small, med, large copies and then toss the original in case it was giant. For audio, you might store a 30 sec mp3, full length mp3, and full length wav, regardless of what the exact uploaded content type was. The point is that knowing the file info for the original uploaded file tells us nothing (necessarily) about what files are available for download.

raix commented 10 years ago

@aldeed true, but:

The beforeSave knows about extensions also if changed/converted - so it would save with correct extension and content.
Filesystem could save to '/collectionName/id/store/filename.ext' - making the uniqe numbering obsolete
S3 would save as it does already/almost 'collectionName/id/store/filename.ext' (like cloudFS)
The gridFS would have several options - but it could be filename that equals 'collectionName/id/store/filename.ext' etc.

The beforeSave would know about extension due to the api eg.:

this.gm().resize(50, 50).save('jpg');
// Or
this.setExtension('wav');

I think having SA's as "dumb" slaves would simplify the logic - they need a filename and some data. True we could send a file object but the SA doesnt make use of the logic. The fileworker should carry out the execution of beforeSave.

Regarding the sync - its something the fileworker should solve - I think if we could have it use the urls and have it be intelligent about it we could have a simple and powerful pattern for triggering sync after upload, and a nice way of rerunning or synchronizing in general.

aldeed commented 10 years ago

@raix, I'm hoping you're going to convince me soon, but so far not yet. :)

The beforeSave knows about extensions also if changed/converted - so it would save with correct extension and content.

The issue is not about saving with the correct extension. It's about the key being correct. If I'm understanding you correctly, here's an example of the problem:

CFS gets a new file with name bunny.png. Knowing the collection, id, and store name, we pre-define the file's unique key as "/images/abcd1234/thumbnails/bunny.png"
CFS passes this pre-defined key and some data to thumbnailsStore.put (which is "dumb", as you say). This store is a filesystem store and has a beforeSave function that shrinks and converts to JPEG. The result is a file with name "bunny.jpg". So we now save as "/images/abcd1234/thumbnails/bunny.jpg".
Now CFS passes in the original key ("/images/abcd1234/thumbnails/bunny.png") to retrieve this file, but we can't find it because it's saved as "/images/abcd1234/thumbnails/bunny.jpg" in the filesystem folder.

And regarding sync, I don't see how the fileworker can solve the issue of name conflicts. Here's an example of that:

We save a file directly into "~/uploads/images/abcd1234/thumbnails/bunny.png" and it syncs back to our CFS collection.
Now CFS gets a new file and it happens to be named "bunny.png" and have the _id "abcd1234" (unlikely, but we have to account for it). Like above, we pass a pre-defined key but there is no beforeSave, so we just save it into the filesystem SA folder. The key CFS creates using the convention is "/images/abcd1234/thumbnails/bunny.png" and we save it in "~/uploads/images/abcd1234/thumbnails/bunny.png". But oops, there was already a file there that was saved externally, so we've just overwritten it.

The only way around this issue is to search the entire collection and look at the existing key for this store for every file and then alter our key if it isn't unique. That is not efficient, so it's much better to let each SA worry about ensuring uniqueness at the time of saving.

raix commented 10 years ago

hehe, gotta do a better job then :)

fileworker gets a file id + collectionName + bunny.png from the transfer mechanism
the fileworker execute the store.beforeSave and gets some data and lets the beforeSave change extension
the fileworker then calls store.adapter.put with the the data and 'collectionName/id/storeName/bunny.png'
then on success updates the fileRecord by adding store to urls eg. $set: {'urls.mystore': 'collectionName/id/storeName/bunny.png'
when trying to retrive the url for mystore we do file.urls.mystore and get 'collectionName/id/storeName/bunny.png' - we prefix with mount point '/fs/files' or custom

Name / key conflicts so - we allow rewrites - there will be no odd cases?

aldeed commented 10 years ago

OK, so you suggest we move the calling of beforeSave into the fileworker, which is fine, and that solves one issue.

This plan still doesn't allow us to know the URL before we save, as in my previously stated goal:

The GET URL should be constructable even if we haven't yet stored the uploaded file anywhere.

So I guess the question is: Is that really an important feature? The reason for wanting to be able to construct the URL before it's saved is to be able to do something like this:

MyImages.insert(file, function (err, fsFile) {
  console.log("Thumbnail file available at", fsFile.url({store: 'thumbnail'}));
});

If we can't know the URL until it's saved, then this would usually (race condition) print "Thumbnail file available at null". But if URLs are constructable without first saving (i.e., if they don't include an extension), then the url method will never return null so we can immediately know what the URL will be.

raix commented 10 years ago

True the url will not be present in urls before the fileworker stored the file to that SA - in meantime - what we could allow is to return the dataUrl if no store found. The cloudFS does this - displaying the image when upload has begun - its handled reactively so when the urls contain the real image that will be loaded instead and the memory released (via the upload queue). (This is only relevant for images and files of a certain maximum size)

We could also, if viewing remotely display a spinner while the upload is progressing or a progress indicator.

Btw. The url is constructable since the filename is just there for sights we could query for the real key: 'collectionName/id/store' - But: it does not matter since the file cannot be retrieved since the SA havent completed.

aldeed commented 10 years ago

The url is constructable since the filename is just there for sights

If we use a dummy filename or no filename at first, that affects caching.

Maybe we just give up on the "always constructable" requirement. One will have to use deps or observes to wait until hasCopy() returns true and then call url() at that time.

raix commented 10 years ago

Well - its constructable - but we can only be sure that we can load the file when the url is in urls - Its true about caching - except using urls without a filename could be resolved and redirected on the server (kind of the signed url scheme) - we should be consistent about urls.

But what do we expect to see when uploading a file - well I like the idea about showing the image from local while uploading and then switching when ready.

If file not found eg. a remote user is uploading it would be ok to show a loading gif indicator? (this could be done by a temporary redirect on the server)

vsivsi commented 10 years ago

Oh my, what have I done here... :-) (jk, this discussion has obviously been ripe to happen for a while.)

It is extremely tempting for me to jump into this conversation with a bunch of concrete proposals of my own based on hard earned experience. I have a long history of working on precisely the types of use cases you are hoping to address. But this is your project and you've clearly put much more thought into what you'd like to build here than I possibly can. Also, this is not what I do anymore, and I need to stay focused.

So in lieu of making a detailed set of counter-proposals or requirements, please indulge me in considering a few general points:

CollectionFS: Meteor webbased filesystem handling up and downloads.

IMO, to really succeed, CollectionFS needs to do this critical and non-trivial thing well. And thus far the core CollectionFS and FileFS constructs go a long way toward achieving that. Congratulations.

However, it is has become clear in reading the dozen or so comments above this one that the vision for CollectionFS is expanding considerably to encompass more of what I think of as "Content Management System" (CMS) functionality. In particular, the planning process seems dominated by use cases involving photo sharing (e.g. automatically generating thumbnails and other alternately resized images, etc.) This is an important use case -- perhaps by volume of files currently the most important -- but it is only the tiniest slice of the possible interesting uses of what you are building here. So my general concern is that you may succeed in building a very good piece of photo (and perhaps audio/video) sharing infrastructure, at the cost of building the best possible general file store for others to use in ways you cannot yet imagine. I understand that this may not be your goal, but I argue that even for your own sake it probably should be.

And so my advice is to focus feverishly on creating a simple, efficient and general purpose file store; and then let a thousand flowers bloom, cleanly abstracted on top of that. One class of such things will certainly be CMS packages, and you yourselves may write the very best one, but work hard to make that a separate thing built on top of the general abstractions of CollectionFS. When thinking about the users of the core CollectionFS, try to imagine all of the different applications involved in this list. Your photo sharing use case touches a tiny fraction of what is out there to be done, even if it seems like the most relevant and pressing immediate concern.

So a few concrete observations/suggestions for thought:

Be fanatical about maintaining the integrity of your architectural abstractions. For example, if there's a way to avoid clients needing to know anything about the physical backing stores, please find it.
Keep complexity at bay. I think I see suggestions above that a given file _id might be associated with different files simultaneously. If true, that will lead to insanity. In a true filesystem the thing with an id must be a file: a single, specific blob of bits that may exist in multiple places (some of which may be stale/invalid), but which has only one valid accepted retrievable state at any given instant in time.
Clearly decide what the atomic unit of CollectionFS should be and associate unique ids with that. If you choose what I will call an "asset" or piece of content (e.g. all of the current resized/format converted versions of an image), then you are not building a file system, you are building a CMS. Analogously, if you add the requirement that all current and prior versions of a given file must be retained, you would be building a revision control system, and not a file system, and your ided atomic unit (in a system like git) is what we know of as a "commit".
Synchronization is very hard to do correctly in a system that can become arbitrarily complex. And a system that is handling user data at scale across an unconstrained set of backing stores with wildly different characteristics will become arbitrarily complex, and it must be correct. My advice is do not underestimate how hard this will be to get right; getting it wrong could easily sink all of your efforts. Again, think hard about how you can constrain complexity and abstract it away from the core functionality.
You've had the foresight to include general metadata in your file abstraction, and have correctly determined that this is the best way to handle seemingly central constructs (e.g. owner). My advice is to put everything you possibly can in there and strip the functionality of your core CollectionFS/FileFS constructs to the bare minimum. IMO, there should be nothing specifically built into CollectionFS for use cases such as connecting an uploaded image to its thumbnail. That can be easily accomplished by a CMS layer built on top of CollectionFS that knows how to correctly respond to: http://localhost:3000/cms/images/universe.png http://localhost:3000/cms/images/universe.png?thumb=true http://localhost:3000/cms/images/universe.png?original=true http://localhost:3000/cms/images/universe.png?resize=1200x800 ...

I've written more than enough, and I hope I don't seem preachy or come across like you don't already know much of this, because you are clearly smart, skilled and motivated enough to get this right. But I saw enough red flags in the above (now much too long) thread that I felt duty bound to relay some of the dangers I see as an experienced outsider. I sincerely hope you find it constructive and useful.

aldeed commented 10 years ago

@vsivsi, generally speaking, I think we're on the same page. We're certainly not planning any versioning (at least I'm not), and the "CMS" aspect is limited to the fact that we're allowing multiple "copies" of a file to be created and stored. (The terms "copy" and "store" are somewhat merged now. Multiple stores can be used to save the same copy in multiple places, different copies in the same place, or some combination.)

The goal is for it to work without much configuration out of the box but support additional things you might want to do.

In the simplest case where you have just one store/copy, a client will be able to request the file back without knowing anything about the stores (i.e., I store a file, I'm told its new ID, I use that ID to retrieve the file back). If a client does happen to know that other copies of the file have been stored, it can request a particular copy.

Currently the anatomical unit is the uploaded file. The id corresponds to this concept. If this file has actually been saved in various formats in various stores, then that information is found in the copies metadata of the uploaded file's record. You say that this means we are building a CMS, but I can't really envision a good file storage package that does not allow you to manipulate and potentially duplicate a file before storing it. Once you allow that, there has to be some way then to tie the various copies together. If you have concrete suggestions in this area, I'd be happy to hear them.

As for sync, that probably won't be implemented immediately, but we're trying to make sure we don't do anything architecturally that would preclude it.

raix commented 10 years ago

@vsivsi no worries :) we have been discussing a lot of stuff the past 6 months in and out of issues - it seems to be a good way of bending each others minds, we do try to find the best pattern or maybe a whole new way of considering files. Its a balance because our main goal is: we want this to be modular/reusable/flexible, fast, light and easy to use.

Multiple filehandlers/copies or stores pr. file is one of the parts that we often need when speaking apps - we have a lot of focus on images since its a very common usage but we also allow stuff like conversion from one sound/video format to another or text to speech etc. its very flexible.

That said you can actually leave out the cfs-fileworker package, the beforeSave and multiple stores support should go away. (when we are done refactoring) You would have the simple use:

var images = new FS.Collection('images', {
  store: new FS.Store.FileSystem('files', '/myfiles')
});
url = images.findOne().url(); // returns the first store if nothing is specified

With the flexibility to easily select your storage adapter

Sync is not going to be a feature in the first release, but as @aldeed said we should have it in the architecture. Sync is now days a fs feature, we could have a dropbox or google drive storage adapter - even the filesystem sa should be able to start a sync. In the current architecture sync isnt that hard - but it requires that we keep track of a universal timestamp for conflict resolution and have a way for the sa to trigger a sync operation. Thats a package to come :)

I guess FS.File and FS.Collection is the core - Everything else is replaceable.

@aldeed I tend to agree with you on the point where we throw an fileObj at a sa and have the sa figure out the reference. Ideas:

The beforeSave is a fileworker thing we should find a way to refactor that into the cfs-fileworker.
Urls are part of the http distribution package - maybe have this part go into the http-accesspoint (have the package extend the FS.File api?).
cfs-http-transfer would provide the upload pattern and api (again extending the existing FS.File/FS.Collection api - maybe have a way to allow multiple ways of transfering files - FS.Transfer.HTTP - if no transfer method assigned to collectionFS then grap first in FS.Transfer?)
Maybe refactor some of the stuff into a general cfs-upload-manager that takes chunk of data and when done contacts the relevant sa.

@vsivsi The client needs not worry about stores in the constructor in the future - I think we are going back to the old pattern here. Its mainly due to the recently shift of upload pattern to http.

  // client
  var images = new FS.Collection('images');

We plan to release before April - mainly need to finish refactoring and write tests.

aldeed commented 10 years ago

@raix, re your bullet list, you must not have looked at the new issues I created yet, but you'll be happy to know that they happen to be calling for exactly the same changes as you suggest in your list. :)

raix commented 10 years ago

Cool - two heads one mind :) I'll take a closer look

aldeed commented 10 years ago

I believe other issues have been opened for anything unresolved in this one.

yozawiratama commented 10 years ago

i wish this can be help :)

PostImages.allow({
insert: function() { return true },
update: function() { return true },
remove: function() { return false },
download: function() { return true },
});

i have that problem too, and i just allow for download: function() { return true }

Meteor-Community-Packages / Meteor-CollectionFS

[Question: devel branch] HTTP access point API beyond GET? #168