Closed kb-0311 closed 1 year ago
pls assign it to me
The issue I am also currently working on can only be merged after these changes are merged . Thus PR will need to wait until this one is done
I will take around 3-4 days to complete the work for this particular refactoring issue. Until then I will complete one more small testing (non-utilities) issue
@palisadoes pls checkout the comment under the now closed PR https://github.com/PalisadoesFoundation/talawa-api/pull/956#issuecomment-1410244587
@palisadoes - Update: I have been learning about how to achieve plugins in the API and admin console and what I have come up with is ->
src/CloudProviders/Amazon-s3
where I would create subdir and files for the complete integration of S3 in the app with #764 as reference. This directory structure will allow us to develop feature to integrate multiple cloud storage providers as plugins in the future.Plugins
collection for the same to detect the existence of this plugin onload(). If and only if the plugin exists for the organization will the exported class from 1 be loaded as a functionality for the organization whose admin added the plugin and the image storage bucket will be s3 instead of the current filesystem.I would love to work on this issue but I have a humble request, as the passing of the environment variable will need encryption layer between talawa-admin and talawa-api for security reason ,
Which is a part of this idea - https://docs.talawa.io/docs/internships/internship-ideas/#api-improved-backend-performance-and-security
I will be applying for GSoC this year under that idea as I had mentioned here - https://thepalisadoes-dyb6419.slack.com/archives/CSWGQ70UQ/p1672734315213179 and since the s3 integration will somewhat require the progression of that idea ,
Could you please add the integration proposal under that idea and make it a 350 hr project ? As then the GSoC period will have good time distribution after this integrating cloud provider sub task is added and to be completely honest, not to throw a pity party or a sob story for myself but due to personal financial reasons I do need the stipend that comes along with. a 350 hr GSoC project. I have been contributing PRs for test coverage and bugFixes as of now and done other extra tasks whenever asked and I promised I would do. I would also like to work on this sub idea and complete it.
For now I will be writing the classes for integration mentioned in 1.
@palisadoes This feature could also be useful when we are giving org admins the option to store their binary data on their specific cloud providers. In other words, this approach could easily open up to develop the options to provide multiple cloud storage providers for that.
What do you think about the proposal in the above comment ?
@palisadoes @kb-0311 You would be introducing needless complexity by integrating s3 into the project if all you need is a way to store small images. Mongodb offers an inbuilt storage solution known as gridFS. It would be easier to use just this. S3 comes with additional complexities which would a pain to set up for new contributers and it's not a free service.
Though if you want better latency, high throughput, less storage charges etc., or the files you're storing are too large like hundreds of megabytes or even bigger then s3 is the viable choice.
@xoldyckk Thanks for your feedback -
So just to summarize -
I think your suggestion is great , you could be onto something here, @palisadoes @xoldyckk should I start working on replacing the file system storage with gridFS this week ?
@kb-0311 lol I didn't read your proposal sorry. The whole idea of using plugins for integrating different storage providers seems really solid and future proof. For storing static files on local file system a native inbuilt solution like gridfs would be good.
Btw the whole round trip of retrieving the files from s3 then sending them to clients is wasteful. It would be better for the clients to directly request the resources. The api just needs to send back the url to retrieve the resource from s3 bucket or gridfs. Though idk what the complications of that approach are. Just wanted to add my opinion.
Idk when you should start working on this. Only @palisadoes can tell you that.
@xoldyckk
The whole idea of using plugins for integrating different storage providers seems really solid and future proof. For storing static files on local file system a native inbuilt solution like gridfs would be good.
Thanks a lot for your compliment, means a lot.
Btw the whole round trip of retrieving the files from s3 then sending them to clients is wasteful. It would be better for the clients to directly request the resources. The api just needs to send back the URL to retrieve the resource from s3 bucket or gridfs. Though idk what the complications of that approach are. Just wanted to add my opinion.
Yes and that's why retrieval of files from the s3 bucket was never my idea, my idea was - when an admins virtually deployed s3 instance is configured in the installed plugin, then a file uploaded will first be stored in s3 bucket and deployed on as a website serving files . The image then would be in the form of a virtually hosted URL when an image - puppy.png
is uploaded ->
https://my-website.s3.us-west-2.amazonaws.com/puppy.png
So basically we store this URL as string for image field for USER/ORGANIZATION etc. In the client side while rendering , this URL from the CDN is fetched and image is rendered directly instead of file retrievals.
(ofc we will need to hash the image names first)
Ofc the migration from filesystem to gridFS would be very good.
- The uploaded content should only be made available to people of the same organization. How would you ensure this would happen with cloud providers? It's an area that I'm not that familiar with.
I have a solution for this but it is a lot complicated and not worth exploring rn , let's just hold on to the whole s3 idea for a minute and allow me to discuss about the other point - file storage on file system on the API server.
- Also we have to remember that most non-profit organizations may not have the funds to pay for cloud hosting, so storage on the API server will need to be the default plugin option.
It seems that we are using Heroku as our backend deployment. Why storing images in file system of API server in this case heroku dino is a bad idea -
The deployment has ephemeral disk that means all the files written on its file system gets deleted once every 24 hours during cycling or every time a new deployment is published. Source -https://devcenter.heroku.com/articles/active-storage-on-heroku#ephemeral-disk
Even if we get a work around for that there is the issue of the size of code that can be deployed on production , for heroku the slug size is only 500MB , for image files across multiple orgs , groupchats , direct chats , users etc this is really low . After the slug size becomes > 500MB it throw R15 error how are we planning to scale in this situation? Source -https://devcenter.heroku.com/articles/slug-compiler#slug-size
Even if we ditch heroku, and move on to some other type of deployment. Lets say the above two problems are solved but in the event when a deployment is failed and we need to redeploy the code , then all the images will be lost. How are we planning to back this up in this case ?
There is ofc the event of some idiot pushing code into the main branch that could delete these files' directory . Source - https://github.com/PalisadoesFoundation/talawa-api/pull/948 (lol). The files will be deleted without any backup.
I strongly propose implementing @xoldyckk suggestion for now about using MongoDB's inbuilt gridFS and storing the image in a separate database, this will make backups , horizontal scaling and separation of concern much easier. This could be the default option. insteam of using filesystem.
As @xoldyckk said -
Mongodb offers an inbuilt storage solution known as gridFS. It would be easier to use just this. S3 comes with additional complexities which would a pain to set up for new contributers and it's not a free service. Though if you want better latency, high throughput, less storage charges etc., or the files you're storing are too large like hundreds of megabytes or even bigger then s3 is the viable choice.
Please provide some more detail on the MongoDB based proposal. For example:
Please provide some more detail on the MongoDB based proposal. For example:
- What new settings will be required?
- What minimum system requirements will be needed?
- Etc.
Sure, Sir.
The gridFs images will be stored in the same database as other metadata. Grid fs will creat two collections - fs.chunks(for storing file data) and fs.files(for storing image meta data like file name , date etc). We can access the stored files based on file name which will be hashed when an image is uploaded. All the logic for reupload duplicate check will remain relatively unchanged.
I went ahead and created a sample code snippet for mongoose (which is a work in progress)- https://github.com/kb-0311/talawa-possible-gridfsSoln Basically, after a file has been uploaded successfully, we can retrieve it using its file name from the bucket , then we will write its data temporarily on the file system. Think of it as a backend cache for image files. Sooo before retrieval, we will check whether image file with the same filename exists or not in the ephemeral file system if yes then we send back the file if not then we will make a round trip to the database and fetch the image and write it into the file system.
This way we will get the optimization of file system along with conistency and scalability of database, and even if the ephemeral disk is cleared we have all the image data stored in db .
before a file is uploaded we will check if the images/ dir which will serve as out file system cache is within the size of 300MB if it surpasses that then the image dir will be deleted.
As gridFS is a mongoDB solution, the system requirements will be the same as mongodb as of the change in the project setting, right now I don't think we will require any as mongoose itself has its own GridFSBucket under its mongo object.
So, what do you think ?
@xoldyckk btw do you have any suggestions regarding this one something left uncovered above ?
@kb-0311 This is assuming talawa-api is deployed on heroku right?
We need a solution that will work first on a local installation of Mongo, then on a hosted version. Our documentation will need to be updated to reflect the different scenarios.
@palisadoes The whole point of using gridFS is to have a remote backup of static files because heroku has ephimeral file system. It doesn't provide a permanent storage solution like s3. So, once the dyno is restarted all the static files uploaded by talawa users is wiped. There's also the issue of how many files you can store on a native file system.
GridFS does work with local installation of mongo. Correct me if I'm wrong @kb-0311. So, no problem for contributers.
Please provide some more detail on the MongoDB based proposal. For example:
- What new settings will be required?
- What minimum system requirements will be needed?
- Etc.
Sure, Sir.
1. The gridFs images will be stored in the same database as other metadata. Grid fs will creat two collections - fs.chunks(for storing file data) and fs.files(for storing image meta data like file name , date etc). We can access the stored files based on file name which will be hashed when an image is uploaded. All the logic for reupload duplicate check will remain relatively unchanged. 2. I went ahead and created a sample code snippet for mongoose (which is a work in progress)- https://github.com/kb-0311/talawa-possible-gridfsSoln Basically, after a file has been uploaded successfully, we can retrieve it using its file name from the bucket , then we will write its data temporarily on the file system. Think of it as a backend cache for image files. Sooo before retrieval, we will check whether image file with the same filename exists or not in the ephemeral file system if yes then we send back the file if not then we will make a round trip to the database and fetch the image and write it into the file system. 3. This way we will get the optimization of file system along with conistency and scalability of database, and even if the ephemeral disk is cleared we have all the image data stored in db . 4. before a file is uploaded we will check if the images/ dir which will serve as out file system cache is within the size of 300MB if it surpasses that then the image dir will be deleted. 5. As gridFS is a mongoDB solution, the system requirements will be the same as mongodb as of the change in the project setting, right now I don't think we will require any as mongoose itself has its own GridFSBucket under its mongo object.
So, what do you think ?
@xoldyckk btw do you have any suggestions regarding this one something left uncovered above ?
Didn't understand point 4. You're going to cache static files on the file system and when it's full you'll do a clean wipe and start caching from 0? Really weird way of doing it I'd say.
So it sounds like Heroku is best for temporary caching of content objects.
@xoldyckk yes , gridFS also works with local instances of mongo so no extra hassle for contributors.
regarding my point 4, it is necessary to because we have to make sure that the dynamic size of the API server slug does not exceed 500MB because of point 2 here -https://github.com/PalisadoesFoundation/talawa-api/issues/952#issuecomment-1413395447
I agree that we should not use the local file system, however I meant a non-cloud Mongo installation, such as an on premises solution.
- We have killed the Palisadoes Heroku instance because of maintainability concerns. I didn't realize that Heroku files were ephemeral. This is good to know and a benefit of making that decision and proposing this one.
So it sounds like Heroku is best for temporary caching of content objects.
@palisadoes So, you want the static files to stay close to the api server?
@palisadoes @xoldyckk I had already written all the points why maybe storing image files in local API server is not a good idea here https://github.com/PalisadoesFoundation/talawa-api/issues/952#issuecomment-1413395447 here I had also mentioned ephemeral disk for Heroku but the thing is most of the cloud deployers have ephemeral disk along with a slug size limit. having a non - cloud solution for this would be tough to implement presuming there is one.
4. There is ofc the event of some idiot pushing code into the main branch that could delete these files' directory . Source - [urgent bugFix #948](https://github.com/PalisadoesFoundation/talawa-api/pull/948) (lol). The files will be deleted without any backup.
The purpose of testing is to catch these types of errors. We all make mistakes. The identification of the error was swift as was the recovery. Making the error a second time would be truly careless. Let's maintain our community values of a welcoming environment.
@palisadoes yeah that point was only meant as a joke lol
@palisadoes yeah that point was only meant as a joke lol
I just realized that it was self deprecating. Sorry.
@palisadoes she was calling herself an idiot. I think you took it the wrong way.
I agree that we should not use the local file system, however I meant a non-cloud Mongo installation, such as an on premises solution.
- We have killed the Palisadoes Heroku instance because of maintainability concerns. I didn't realize that Heroku files were ephemeral. This is good to know and a benefit of making that decision and proposing this one.
So it sounds like Heroku is best for temporary caching of content objects.
@palisadoes So, you want the static files to stay close to the api server?
@palisadoes this is what you meant right?
@xoldyckk I hope I cleared up your doubt on point 4 in the previous comment
@palisadoes So, you want the static files to stay close to the api server?
@palisadoes this is what you meant right?
We have to consider two types of user. Developers setting up on their local system and administrators who may or may not want to do that.
@palisadoes It's not about the benefits. It's about having a permanent storage(like s3) which has a backup of all the static media files uploaded by talawa users/orgs. Heroku doesn't provide a solution like that. So, if you want a on premise solution you have to move to a different cloud provider which does provide it. If you're hosting on heroku you have to use a remote storage solution from some other provider.
Unless the org admin does not consider having a backup of their static files a benefit and only want to store their static files temporarily which is wiped each day?
And of course there's the issue of 500mb limit on heroku.
Yes, @palisadoes let me just elaborate on both of these points that you raised -
fs.chunks
and fs.files
. So the initial setup / configuration is practically unchanged. @palisadoes It's not about the benefits. It's about having a permanent storage(like s3) which has a backup of all the static media files uploaded by talawa users/orgs. Heroku doesn't provide a solution like that. So, if you want a on premise solution you have to move to a different cloud provider which does provide it. If you're hosting on heroku you have to use a remote storage solution from some other provider.
Unless the org admin does not consider having a backup of their static files a benefit and only want to store their static files temporarily which is wiped each day?
And of course there's the issue of 500mb limit on heroku.
We can discuss this on our call tomorrow. I lack the background, but really want to understand the long term implications. I think there are unspoken assumptions that a verbal discussion will rectify.
@palisadoes will it be okay If I could participate in that verbal call too? Because I have explored a lot of alternative solutions to tackle this problem and how to go about implementing gridFS if it could be implemented that is.
@kb-0311 It'll be you of course it's your idea. Besides I'm only providing input I don't exactly know the nitty gritty details of of the implementation.
@kb-0311 It'll be you of course it's your idea. Besides I'm only providing input I don't exactly know the nitty gritty details of of the implementation.
Thanks a lot , Sir.
Umm btw I am relatively new here how do verbal calls in the organization work exactly ? Like a google meet or something? And how will we be able to join?
@kb-0311 Please create a channel on slack so that we can coordinate.
@kb-0311 , @xoldyckk
What would you consider basic functionality that could be implemented within a week? I'd greatly prefer to have this done before the GSoC organization announcements.
@palisadoes Yes I will get the basic file uploading, storage, and deleting in the database done for the contributors next. But as we discussed earlier on call I am a little preoccupied with my university exams , which will end on 15th Feb so till that time I will be working on this locally little by little since it will require complete attention on my part to not cause breaking changes.
However, I do promise that as soon as my exams are over on 15th Feb, I will dedicate most of the time to GridFS implementation. GSoC org announcements are on 22nd Feb, I will try to complete at least the basic Image object handling by that time. Till then, before the 15th, I will be submittting PRs for more minor issues here and there.
@palisadoes Progress (1/4) : In the call you mentioned, there needs to be a method to make sure the same exact binary object does not get uploaded again and again taking up space. So for that , it turns out there was a sort of solution here already in this file, -> https://github.com/PalisadoesFoundation/talawa-api/blob/develop/src/utilities/reuploadDuplicateCheck.ts
Basically, if I am not wrong we are hashing the binary of an image and storing the hash value in ImageHash
model and then instead of reuploading the same file again and again we make sure the image field points to the already uploaded binary object in the database.
This one is a good utility in my opinion What do you think?
OK, we can reuse this
@palisadoes @xoldyckk Good news:My exams are over and I am working on this issue full time now.
Bad news:
The current apollo server implementation in the project is wholly deprecated (by that I mean even if we upgrade it to v3, v3-is-going-to-be-deprecated-on-October-this-year-kind deprecated-deprecated). Leaving the graphql-upload functionality (the basis of file uploads in our server) open to CSRF attacks.(In the current Project we are using apollo server v2 which is open to CSRF attacks even WITHOUT file uploads integrated as it has a very deprecated older version of graphql-uploads
package integrated by default , leaving our production server exposed to CSRF attacks at anytime)
https://www.apollographql.com/blog/graphql/file-uploads/with-apollo-server-2-0/
In order to completely prevent CSRF attacks and integrate make the file uploads functional we will first need to ->
graphql-upload
package individually.csrfPrevention: true
config to the apollo server constructor and set the upload scalar with graphql-upload file.Apollo-Require-Preflight
headers in the client side too.Which begs some questions.
graphql-upload
to work. Is there any reason for this delay?https://www.apollographql.com/docs/apollo-server/previous-versions/#apollo-server-2
https://github.com/apollographql/apollo-server/security/advisories/GHSA-2p3c-p3qw-69r4
@EshaanAgg how long would it take for you to submit a PR for migration? If you are busy with other work/ exams or stuff would it be okay with you if I do the migrations too? In exchange, you can write tests for requestTracing.ts
file (a issue I was assigned)
@xoldyckk your thoughts on migrations?
@kb-0311 If you ask me I'd say we should completely ditch apollo-server and instead make use of graphql-yoga. It's built on web standards, supports all modern features, less bloated compared to apollo-server, keeps itself minimal and let's the user choose what they want to use.
It's also officially maintained by the-guild. They're the ones responsible for maintaining most of the well known packages in javascript graphql ecosystem.
With graphql-yoga you can make use of a plugin architecture to make use of the plethora of envelop graphql plugins to integrate different functionalities in an easy way.
Migration from apollo-server to graphql-yoga is also very easy. Look at this. Not much to do at all. The biggest feature of apollo is apollo-federation. It supports that too out of the box.
Basically when using tools from this ecosystem you can rest assured that you're following the current industry standard and adding new functionalities won't break the app.
I'd also like to say we should replace many other packages used by talawa-api to their modern counterparts which have active support, work better with typescript, are more performant etc.
@xoldyckk I agree with you however, there is a soon becoming a long list of changes to make to the project dependencies as discussed in other threads (Prisma ORM , enforce graphql-pattern into queries etc )which probably needs to be done one by one.
Sure the migration part is easy but it does not support batching. Also to perform functions available in the apollo server it requires external packages as plugins. (for eg for CSRF attacks it requires another plugin etc). Not saying it is not worth it but I doubt I will be able to complete it before the 22nd.
Secondly, migrating to apollo v4 for at least the short term to make sure the images upload/delete, etc are functional on the client side does make me think that at least for now we should do minimal changes to the dependencies. SO migrating to v4 does seem like a more feasible option at least for now in short term.
@xoldyckk @kb-0311
I had started working on the migration from apollo-server-2
to apollo-server-3
, but found there were many breaking changes, which required refactoring the same. In view of this, @palisadoes decided to close the issue, and keep it as a GSoC
proposal idea.
In view of using graphql-yoga
, I am personally in complete support of the same. But before making this switch, I feel that we should leave this discussion open for a couple of days so that other active developers can also drop their views about the same. We will need also additional review on the fact, that should we migrate the project on a couple of PRs that migrate the whole codebase
basis, or as a complete project in itself basis.
Testing of image utils should only be worked upon once we arrive at a stable server for talawa-api
.
@EshaanAgg @palisadoes duly noted. So for this PR I will be at least making sure the underlying API logic for gridFS integration in done .
The client side integration will ofc be after the migration.
@EshaanAgg I take it you will be working on migration project for GSoC? How long do you think it will take for you to complete it ?
@palisadoes could you please add the apollo server migration idea to the talawa-docs as a GSoC idea this year please? thanks.
@EshaanAgg I take it you will be working on migration project for GSoC? How long do you think it will take for you to complete it ?
I would definitely try to take it up during that time! As to comment on the timeline for the same, I'll need to read the documentation of both the current and the proposed server in detail, which is something I'll get to most probably in a week's time (as I have my semester exams going on right now).
@xoldyckk @kb-0311 I had started working on the migration from
apollo-server-2
toapollo-server-3
, but found there were many breaking changes, which required refactoring the same. In view of this, @palisadoes decided to close the issue, and keep it as aGSoC
proposal idea.In view of using
graphql-yoga
, I am personally in complete support of the same. But before making this switch, I feel that we should leave this discussion open for a couple of days so that other active developers can also drop their views about the same. We will need also additional review on the fact, that should we migrate the project on acouple of PRs that migrate the whole codebase
basis, or as a complete project in itself basis. Testing of image utils should only be worked upon once we arrive at a stable server fortalawa-api
.
It does support batching, just not enabled by default as batching is a bad practice.
CSRF is a one liner to integrate.
Schema directives(like @auth) in apollo-server 4 have a new implementation. They're not the same as talawa-api currently uses. They make use of graphql-tools which again is another package maintained by the guild. So, you will have to code it from scratch if you're gonna use apollo-server 4.
Btw I think you didn't go through their docs properly so you don't see the value it would provide. There are many other useful features it provides. One important one is making testing easy. Check this. Another important one is how easily it works with typescript compared to apollo-server.
Whatever effort it'll take to migrate to apollo-server 4 the same will apply to graphql-yoga too. So, if you're going to put effort in something you should choose wisely and think of the future prospects.
Most of the graphql related code lives in resolvers and typeDefs which doesn't depend on what server implementation you use for graphql. So, they won't need to be changed.
I personally think it's a bad idea to include upgrading packages and dependencies as a GSOC idea. This is something which should be continually maintained in the project instead of remaining stale and waiting for a certain event(like GSOC) to be fixed. And the time right before GSOC is a perfect time to do this.
This will only create friction between features implemented by different contributers as there will be version dependency conflicts. If you have the time for it right now why wait for later?
@xoldyckk I too would prefer that migrations would be done before the GSoC period starts so that this image thingy could be implemented before that, no point in delaying it.
However, I think it will take a little more time than what was previously agreed - (22nd feb)
The aim of upgrading packages as part of GSoC was to make sure that they were all up to date by the end of the period. They would then be kept current with dependabot.
It wasn't the intention to stop upgrades completely until then.
If there was no clear path to doing this upgrade, then it should be done as an idea.
I'll upgrade the wording of the idea
Describe the bug A clear and concise description of what the bug is. Currently the Image file handling utilities for the application are not the best approach to solve the problem.
As Specified in this comment -> https://github.com/PalisadoesFoundation/talawa-api/pull/948#issuecomment-1407703633
I think instead of whatever we are doing now the best possible approach would be to
Convert the client side image to binary string and then make a mutation using that uri as string to the API.
In the API , we update the Object (Post ,User , Organization) image field with that binary uri .
Whenever we need to fetch or render images we can just do it by converting the Buffer binary data string of the object image field. This way -
We really reduce the size of the data because we are not sending and retrieving MBs of png files back and forth between API and client.
Fixes this legacy flaw that this whole fiasco revealed.
I will need to change the following files ->
src/utilities/uploadImage.ts
src/utilities/reuploadDuplicateCheck.ts.ts
src/utilities/deleteImage.ts
src/utilities/imageExtensionCheck.ts
src/utilities/imageAlreadyInDbCheck.ts
src/utilities/deleteDuplicatedImage.ts
In the API project and also make changes for talawa-admin and talawa mobile app repo so that changes can be made to the client side also. Screenshots If applicable, add screenshots to help explain your problem.Additional details