PalisadoesFoundation / talawa-api

API Backend for the Talawa Mobile App. Click on the link below to see our documentation
https://docs.talawa.io/
GNU General Public License v3.0
222 stars 804 forks source link

No Input Validation in the any of the form field #1064

Closed anshgoyalevil closed 1 year ago

anshgoyalevil commented 1 year ago

Describe the bug There are no input validations in the resolvers, thus, the following validations are being failed:

To Reproduce Steps to reproduce the behavior:

  1. Go to Talawa-Admin
  2. Try creating an account with a very easy password like cool
  3. Try creating an org with 6000+ words in its name, location, tag, etc.
  4. Try creating an event with end date before the start date, and do the same as point 3.
  5. Try creating a post with things as in point 3.

Expected behavior There must be errors thrown for such a behavior.

Actual behavior No error thrown.

Screenshots image image

kb-0311 commented 1 year ago

@anshgoyalevil What do you think could be the best approach here? Are you going to limit the number to characters by -> a) making sure that if a request field exceeds the maximum length, the request should be rejected as a whole. b) If the characters exceed the maximum length only characters till the maximum will be stored ?

kb-0311 commented 1 year ago

We will need to discuss this because this problem is not only limited to Post or Event mutation. This problem extends to each and every mutation where we are taking a string as an input for eg in signUp security bug I was working on, I noticed I could technically write a very large length of a string.

Could you open a parent issue for the same where child issues can be assigned to others?

anshgoyalevil commented 1 year ago

@anshgoyalevil What do you think could be the best approach here? Are you going to limit the number to characters by -> a) making sure that if a request field exceeds the maximum length, the request should be rejected as a whole. b) If the characters exceed the maximum length only characters till the maximum will be stored ?

I would be limiting the number of characters approach, since storing the characters till the maximum length would create data discrepancy.

anshgoyalevil commented 1 year ago

We will need to discuss this because this problem is not only limited to Post or Event mutation. This problem extends to each and every mutation where we are taking a string as an input for eg in signUp security bug I was working on, I noticed I could technically write a very large length of a string.

Could you open a parent issue for the same where child issues can be assigned to others?

Yes, I already mentioned that there is no validation in any of the form field, including Organization, Post, Event, User Creation forms.

I don't think, this issue would need child issues to work on, since I am thinking of a strategy to apply the logic to the whole input base, with a maximum string length of 255 characters. There are other minor conditional checks, which need to be added to the mutation resolvers.

If I would need any help from you, or if it seemed like the codebase is gonna change a lot, I will surely ask out to divide it into child issues.

kb-0311 commented 1 year ago

@anshgoyalevil sounds good to me

xoldd commented 1 year ago

@anshgoyalevil @kb-0311 here's some stuff I wrote few months ago. It'll provide you some more insight. https://docs.google.com/document/d/1BgD354MQukK6pCJ9Xgfpzx5L1o-qq4m19odgfXXJRkU/edit?usp=sharing

anshgoyalevil commented 1 year ago

@anshgoyalevil @kb-0311 here's some stuff I wrote few months ago. It'll provide you some more insight. https://docs.google.com/document/d/1BgD354MQukK6pCJ9Xgfpzx5L1o-qq4m19odgfXXJRkU/edit?usp=sharing

Thankyou @xoldyckk, will read it in depth.

kb-0311 commented 1 year ago

@xoldyckk It seems you had mentioned about the getSort() helper I had sent a PR for a long time back. Now that it has been implemented, I guess we can remove that as a problem in your watchlist. Along with several others.

Thanks for this btw really helps with compartmentalizing the problems.

xoldd commented 1 year ago

@kb-0311 getSort functionality was within the resolvers. I had seperated it out into functions because it was cluttering the resolver logic for no reason.

Btw current implementation of getSort is not good enough as it is not type safe.

xoldd commented 1 year ago

Talawa-api also used graphQL in an anti pattern way. There should not be so many query fields to resolve connection objects. Those connection objects should be nested as relations for the corresponding parent objects.

Everything should be a single graph. To save on useless calculations for parent objects when you only want to fetch a deeply nested child object you can introduce the node interface pattern.

kb-0311 commented 1 year ago

@xoldyckk So do you mean instead of just using mongoose's populate function to get connected nodes, what the API should be doing is fetching the connected object or rather the population fields of the child node, necessary for a given logic of the parent resolver in a nested manner from the database in that given resolver?

How to go about making the current getSort() function type safe?

xoldd commented 1 year ago

@kb-0311 Yes that's how graphQL is supposed to work. If you return everything within the root resolver you're just wasting resources.

Example:-

User {
    userName
    posts
    followers
    followees
}

With the current approach if the client asks for userName field your resolver also queries the database to get all the posts, followers, followees which are expensive fields to query. If the client needs just the userName field which is present on the base object why would you also fetch the other expensive fields? If you had nested resolvers for those expensive fields graphql would only execute those resolvers when the client explicitly asks for those expensive fields.

xoldd commented 1 year ago

@kb-0311 getSort can be made type safe according to the sorting parameters mongoose accepts. Though mongoose itself is not type safe so it doesn't really matter. lol. It will accept anything you pass to it as the sort object and only throw errors at runtime which goes against the philosophy of typescript which is to catch errors at compile time.

kb-0311 commented 1 year ago

@xoldyckk Got it :+1: I have been thinking about that too, why are we using populate function so much, eg:

Screenshot from 2023-02-12 16-26-24

This does seem that refactoring the codebase into nested queries could be a really good idea.

So basically as mongoose's sort field is itself not type-safe there is no way to consistently make sure the sorting parameter itself is type safe? lol.

xoldd commented 1 year ago

@kb-0311 Returning everything in root resolver is a graphql anti-pattern but that's exactly what mongodb is good at. It sucks at querying relational data and is extremely slow. Because it is designed purely to store related data within the same document the parent is. Like just embed the data inside it. This is unlike relational databases which have seperate tables for each data object which are connected through a foreign key.

xoldd commented 1 year ago

But if you embed data you'll have problems later on if you want to perform operations on that embedded data and if that embedded data is related to other objects. You can also have duplicate objects in mongodb. Same object with same object data embedded in different documents.

You see where I'm getting at. Managing it is a headache and it's not designed to work with relational data. It's good for just dumping useless data which needs to feeded to some Artifical Intelligence system or something. It is not ACID compliant. You think you deleted some object? No it is embedded into 20 other objects as well. And not only one level embedding. You can have many levels of nesting and embedding.

Add to all this the fact that you have no type safety in code. You don't know what object references what, what's embedded where, how many places you need to mutate it if you want to etc. etc.

kb-0311 commented 1 year ago

Ohhhkkk I got what you are trying to say.

So basically because we are using MongoDB there are performance downsides to it. Hence that explains why we need to use forEach loop in our resolver to delete the related documents when deleting a parent document.

kb-0311 commented 1 year ago

So what do you suggest as a possible solution?

xoldd commented 1 year ago

@kb-0311 mysql or postgres. For a hosted service PlanetScale is the best right now. For ORM there's prisma. Best in the market designed from the ground up for typescript first experience.

kb-0311 commented 1 year ago

I would recommend postgres too,

While I was learning graphql I was using postgres too https://github.com/kb-0311/civilised-forum-and-notices. lol.

But I highly doubt we would be migrating the database anytime soon though

EshaanAgg commented 1 year ago

I myself would highly advocate for Postgres and Prisma. I have worked with both of them before, and it's been one of my favourite backend development experience till data. Another benefit of using relational databases, you get detailed table structures and relation diagrams that help beginners to understand the data and how it works. Using MongoDB, you have to heavily rely on your intuition for the same.

palisadoes commented 1 year ago
  1. Is there room for a hybrid approach? Using a relational DB with an ORM for the relational data and MongoDB for other areas?
  2. If so, what would the high level design be? How could this be done with minimal disruption?
aviraldevv commented 1 year ago

@palisadoes there's always a room for hybrid approach provided we are ready to handle the issues related to complexity and scalability.

To implement the same :

  1. Need to sort out data suited for relational and non-relational database.
  2. We can use ORM such as Prisma to interact with PostgreSQL and handle relational data.
  3. We can use MongoDB for the semi-structural data.
  4. Also we have to integrate the two database and use Prisma's API to access both the database.
aviraldevv commented 1 year ago

We can try this out for a small isolated part of our system and try to figure out how things can work out.

anshgoyalevil commented 1 year ago

@palisadoes Please assign this issue to me.

aviraldevv commented 1 year ago

@palisadoes @xoldyckk what do you think about this??

xoldd commented 1 year ago

@palisadoes @aviraldevv @anshgoyalevil @EshaanAgg @kb-0311

There's not a single data model in talawa-api which is non-relational. Non-relational data doesn't exist in applications almost 99% of the times.

First let me tell you this. MongoDB is perfectly fine unless your app is scaling to millions and you need fast operations or need to resolve extremely complex relations along with complex sorting and filtering logic. It's the package mongoose which is bad because it has poor integration with typescript. At the very least that package should be replaced with prisma.

Prisma has official support for mongoDB and it is made from the ground up to cater towards typescript and good developer experience. Also the way many models are mapped in mongoose models is bad design-wise for relational data and type safety. They should be mapped in a normalized way like relational database models are. There should be no embedding of documents. Also there should be a strict enforcement on types and expected values of the model fields according to the business logic.

Now if you want to know about integrating sql databases read below:-

The reason I and many other people are recommending the usage of a relational database is because of the performance, safety, reliability and ACID capabilites they provide. Scalability is not an issue because it's managed by the DBaaS the user chooses.

There are no use cases where mongoDB would provide value over a sql database except for the native gridFs implementation they provide for file storage. Though that also only works with hosting providers that do support gridFs functionality in their own mongoDB implementation. And also its implementation to store files is completely different from object storage providers like S3, google cloud storage etc. Same is true for the functional logic required for sending/receiving files through it.

If we go with relational databases like mySql or postgres we'll have to use a seperate object storage provider like S3, google cloud storage etc., for file storage. This is not an anti-pattern. This is the industry standard and convention for storing static files.

Also there's no scenario where the person using the talawa packages can get away without providing their own configuration for database and storage provider. We're not hosting it for them or anything. Talawa-api should be made integrable with different database providers and storage solutions(at least the ones that are popular). This functionality would be available through plugins.

Mostly plugins would only be required to integrate different storage providers. Because the database is usually just integrated through a DATABASE_URL environment variable and talawa-api would't care where and how the database is hosted. Though some good recommendations for good database providers can be provided in talawa-docs.

palisadoes commented 1 year ago

I figured that would be the answer. If we were to go all RDBMS should that be a 2023 or 2024 project?

  1. We have a lot of GSoC ideas for 2023 that could create an unusable application if the integrations fail.
  2. It may be worthwhile considering the change for 2024 and have all ideas focus on the migration
xoldd commented 1 year ago

@palisadoes I guess we make a separate branch for all talawa-api related work that should be done for GSOC 2023. The mobile and admin will continue to use the older version of talawa-api which will be on develop branch.

xoldd commented 1 year ago

I'll think about it and tell you later.

EshaanAgg commented 1 year ago

@palisadoes @xoldyckk @anshgoyalevil @kb-0311 @aviraldevv My two cents:

  1. We must shift to a RDBMS as soon as possible, because of the follwing reasons:-

    • The data being stored currently is highly structured and relational by it's very nature, and thus using a NoSQL database for the same is actually against all the design principles.
    • NoSQL databases, as a rule of thumb, must be used when the structure of the data is not known, or it's highly variable. In our case, all our data shape is well known, and MUST BE ENFORCED. In such a case, it makes much more sense for us as developers to outsource the job of this enforcing of constraints and maintaining shape of the data to an ORM or database service, rather than implementing all these checks ourselves.
    • NoSQL databases have been designed for applications with high writes, and less reads. On the other hand, Talawa ecosystem does more reads than writes, and thus a SQL database is the obvious choice.
    • The longer we wait before making the switch, higher would be the number of functionalities that would have been implemented in the NoSQL database, thus making migration ever more tiresome.
    • Integration and the auto-completion that we will get with Prisma and a relational database, is just phenomenal. It will help to make the code base simpler and more intutive to the one's beginning.
  2. Switching to Prisma is a big overhall, but in the process we have to explicitly make the relations and data structures. While designing the same, we can work on parallely updating the documentation about these models and their constraints of the Talawa-Docs, which is something that we severely lack.

  3. Instead of using a separate branch, I personally would recommend creating a new repository itself. The reasons for the same include:

    • Nearly everything in this repo needs to be refactored (even GitHub actions and instructions). Starting from a blank slate would be much easier, and won't force us to fight our own code.
    • Instead of having a bug branch where nothing works, we will start to start to build the same in incremental ways where small aspects of the api are working and can be tested. The talawa-api would act as the guiding source-of-truth for the developers to work on.
    • Would allow for easier collaboration, as we can brainstorm on the steps of migration (like defining schemas, defining their relations, creating appropriate indexes, writing resolvers)
  4. Working with file storage solutions and using different plugins to support different providers should be the last concern of ours.

palisadoes commented 1 year ago

@DMills27 @JamaicanFriedChicken What are your thoughts?

DMills27 commented 1 year ago

@EshaanAgg I'm not convinced by your arguments that an SQL database would be absolutely necessary here. The biggest issue with introducing a SQL database would be the overhead in terms of costs that arise from vertical scaling that come with SQL databases.

• Regarding your first point: while there are a number of schema that suggest a relational nature, it could be the case that these schema could be "trimmed" down or refactored to make then "less relational". What I mean is that in all the schema we have not all of the relations are used to the best effect, for instance Facebook uses a NoSQL database (after switching from an SQL database in its formative years) to manage its social media site and you could argue that many things in that site are "highly relational in nature".

• Regarding your second point: You mention the "enforcing of constraints and maintaining shape of the data" but are you aware that MongoDB is actually ACID compliant for multi-document transactions since MongoDB 4.0 was released in 2021? (https://www.mongodb.com/transactions).

• Regarding your third point: you state that "NoSQL databases have been designed for applications with high writes, and less reads" but this isn't actually true. In general, NoSQL databases have faster read and write speeds than their NoSQL counterparts. See https://www.cs.rochester.edu/courses/261/fall2017/termpaper/submissions/06/Paper.pdf

As it relates to the rest of your comments, I think perhaps introducing a cache such as Redis many allay the concerns brought up here. I've never used Prisma but heard good things about it so I can't speak at length about it. One could separate the "relational" aspects of the GraphQL schema and then interact with Redis to produce the desired results in an efficient manner.

anshgoyalevil commented 1 year ago

@palisadoes @xoldyckk @kb-0311 @aviraldevv @EshaanAgg @DMills27

My thoughts on this:

  1. We can consider using Sequelize, if we want to use a RDBMS, since its syntax is very similar to Mongoose, and it also supports TypeScript support.
  2. If we still want to use the current database, i.e., MongoDB, and want TypeScript support, we can consider using the Prisma as suggested by others above.
  3. What I would suggest is, we should not shift the database to SQL as of now.
  4. We should consider using Redis for caching as suggested by @DMills27
kb-0311 commented 1 year ago

@palisadoes @DMills27 @xoldyckk @anshgoyalevil @aviraldevv

I don't think we would be migrating databases any time soon nor introducing Prisma soon. However, that could be an idea for GSoC '24. (maybe)

Regarding the computational cost of queries->

  1. As @xoldyckk pointed out, the current graphQL implementation is anti-pattern and thus each query is computationally expensive. The best approach as suggested by @DMills27 is caching with redis. I would suggest batching large queries as well as caching them.
  2. I had previously shown interest in the GSoC- >(API: Improved Performance and Security idea for this year) to other contributors like @xoldyckk @anshgoyalevil and ofc to @palisadoes and what my main proposal idea for implementing improved performance was exactly this - (caching and batching queries) there was that - a. if we are using MongoDB the best ways and making queries in a root resolver the best approach in my opinion would be to divide our resolvers into making database queries and another for caching and batching the results of the queries (till a mutation is made to alter the results of a query is made then it will be refetched ofc to avoid stale data) b. @DMills27 instead of redis only. I was thinking of using something like Dataloader https://github.com/graphql/dataloader It is a utility maintained by graphql itself AND it is widely used for this exact use case.

To summarize, if we are delaying the migration of database or Prisma ORM , then maybe this year we can implement this.

palisadoes commented 1 year ago

Please remember that those on this thread are focusing on Talawa-API. We are precarious in Talawa-Admin and lots of work needs to be done. Doing simultaneous major redesigns of two interdependent repos will not be wise.

Talawa Admin - 2023

If you have an hour to spare, I strongly suggest you take a look this video to:

  1. get an idea of the current state of Talawa-Admin,
  2. understand where Talawa-Admin needs to be
  3. become familiar with what a community organization management software really is. The video walks through features. It should be eye-opening, even if you are not interested in Admin.

I urge you to find a way to solve this with:

  1. minimal schema modifications.
  2. maximum performance gains

We still have 2024.

kb-0311 commented 1 year ago

@palisadoes implementing Dataloader would not require much schema modification. The API routes and resolvers will be similar only the queries will be cached and in case of large queries, they will be batched as well.

I agree we should not completely redesign the API project this year and leave it as an upcoming task.

I had seen the call recording, could not join because it was very early in the morning, I get why redesigning has to be put on hold for a while. There are also a lot of inconsistencies with the client(talawa-admin , talawa mobile app) query/mutations with those that exist in the API.

@anshgoyalevil has been working with admin portal the last time we talked maybe he can provide some more input on how to mitigate inconsistencies in the admin portal And someone who is well familiar with talawa mobile app will also need to be brought into this conversation..

anshgoyalevil commented 1 year ago

@kb-0311 Totally agree with you. There are a lot of things which need to be modified/added to the Talawa-API. There are many things in the Talawa-Admin project which are acting like a placeholder. For example, Contributors is present in Menu of Admin Dashboard but is not implemented in the API, Block/Unblock System works very inconsistent.

Talawa API project needs a lot of work, since, there are certain Mutation/Query resolvers which were supposed to be there according the Talawa Admin project. One such issue, I resolved 2-3 days back was also related to this, as there was edit post button inside the Admin Panel, but, no mutation resolver was present in the API.

While, according to the video @palisadoes shared, this is also a good thing, since, this gives us an idea of what needs to be implemented in the API after looking at the dummy links in admin panel.

I mostly worked on the Talawa API so far, and observed that there are many things which are missing. One such feature is Spam Protection, which was reverted back last year due to TypeScript migration issues. It is now included in the 2023 idea list to include it as Service as a Plugin model.

I hope, many such issues will be resolved this year, and the other issues such as database migration can be discussed over a meet for the next year, i.e., 2024.

palisadoes commented 1 year ago

We also have big issues on the API with outdated and deprecated dependencies. The Apollo server update is a big one.

This will need to be added to our GSoC ideas.