Machine Learning Model - Githubissues

alejmedinajr commented 6 months ago

This is the issue related to the machine learning model that will be used as an additional form of classification between AI and human content. This can be started slightly without the testing data, but when the testing data is available, then the model can be benchmarked for accuracy.

alejmedinajr commented 6 months ago

Starting this now

alejmedinajr commented 6 months ago

Current update: I decided in place of our data set, I am going to use something that can be binary classified (I ended up deciding on using a dataset that is binary classification for rain in australia for now). I have logistic regression and Decision Tree models working for this data set (although the accuracy is pretty bad). I am aiming to have several different models for classification that can be used to determine the best performing one (after several phases of automated testing) in order to at least use our best and highest tuned model. This dataset has a couple of features (columns), and the more I think about it, maybe the features we use in our model can be the different accuracies produced by all the text comparisons (along with maybe size of text?).

alejmedinajr commented 6 months ago

Current Update: I read more about the different ways that we can benchmark the performance, so now the precision score, recall score, f1 score, and roc auc are included as well. These are in one of the scikit packages, along with more models I implemented for this data, including random forest and SVC (Support Vector Classifier). These newly added models take a significantly longer time to fine tune than the first two, which means we may have to disregard them if they behave slow with our data as well. For now, I will keep them since they seem to have a better accuracy for this particular dataset. Right now data is using a 80/20 split, but the optimal range of splitting is anywhere from 60-80, so this would also need to be tuned as well. Also, I looked further online and found that some success can be found using a split of 80-10-10 where 80 is training, 10 validation, and 10 testing. This obviously is a further step. The problem here is not the ML model or even the code for it (there are so many packages). The problem is correctly tuning the model and using the right model. This is mainly why my next update will be focused on automating the process of tuning the parameters of the models. Near the end of this work interval, I was looking at efficient ways to tune the hyper parameters, and I found a cool way to automate the tuning phase, which can be found here. I will work further on automating the tuning phase for the Australia rain data.

alejmedinajr commented 6 months ago

Current update: I started messing around with the tuning of the hyper parameters using the previously mentioned GridSearchCV, and I also messed with the ranges and step sizes of the parameters. I think this explains why the other two models take longer to perform (I ended up commenting them out just to speed up the process for myself). I also looked into how I can save a specific model instead of rerunning this again. The purpose of this is to keep the best performing model, and just use this when new data is available (instead of going through the automated process of going through all step sizes). I do think this would be helpful since our users are most likely not going to wait for this lengthy process. This might be especially true considering the performance accuracy is not an overly large improvement, especially for a user who would likely care about the percentage being >70% as opposed to caring about the difference in percentage of 85.42% and 87.31%. I think with this taken into consideration, saving a high performing model and then just using it with the addition of a newly updated file might be a great alternative. I found a way to save the models in a file using the joblib library. I then tested this on random rain data based on random values (following the same format as the original csv). This worked (I was able to use the model previously trained); however, I think this was not the best way to test the extended performance of the model since it does not have accurate data (random is not accurate). I think in the future I will use some of the real data instead. I am at a good stopping point, so I will push code that I have on this branch (though it still needs to be documented).

alejmedinajr commented 6 months ago

Starting back on this right now, now that the dataset is in the google drive, and can look into testing some of the content here, as well as the text comparisons.

alejmedinajr commented 6 months ago

Current update: I have created a helper function to read all subdirectories and files in the subdirectories from a main directory. I am also in the process of creating a helper function to append strings to a csv file. The purpose of this is to populate the training data used for the model. These helper functions will be located in parsing.py

alejmedinajr commented 6 months ago

Current update: I found out I have a design issue in how I want to parse the directories. If the dataset is to be made from all the files, then there is a problem with the way I am doing it. This is caused by the disability to make any comparisons due to all data (AI and Human) being in the same directory. I think I will move them out of the same directory and find the average of all comparisons in order to see and decide if this is a viable solution or if things need to be rethought.

alejmedinajr commented 6 months ago

Current update: I ran into an issue reading a pdf. For some reason, the pdf is not being read correctly, which does not make sense. The specific problem I am running into is an end of file marker missing issue. I will look into this, but I am close to being able to compare all of the human files to the ai ones (with the purpose of producing training data for the models in models.py).

alejmedinajr commented 6 months ago

Current update: I tried fixing the file parsing problem using some approaches mentioned here: https://github.com/py-pdf/pypdf/issues/480. They did not work so I am looking into changing the pdf package we are using. At the moment, I am looking into pdfplumber.

alejmedinajr commented 6 months ago

Current update: I am still not able to fix the problem with the missing EOF marker. I tried two other pdf packages, as well as different files from the testing set to see if it was a specific file. I think I have spent enough time on this to open a new issue for it.

alejmedinajr commented 6 months ago

I am starting back on this since I found a quick work around (see #53 ).

alejmedinajr commented 6 months ago

Current Update: With some slight modifications, I am able to finally use the test files to compare all human solutions with AI solutions and write them to csv. I am going to add column headers so the csv file is more understandable in terms of the features we want to extract. This is definitely a starting a point.

alejmedinajr commented 6 months ago

Current update: I am still working on the formatting of the csv file. Here is the current version, which is still a work in progress.

alejmedinajr commented 6 months ago

Current update: I ran into a problem where I realized I had multiple datapoints that were repeats of the same datapoint. I have fixed this by redoing how I was making the comparisons. I ended up making one big list that way the ai datapoints can be compared to everything as well.

alejmedinajr commented 6 months ago

Current update: I ended up breaking the existing code (now I am not sure why the logic is not adding up). I think I will revert back to the previous image since that was closer to what I was looking for (when compared to the more recent changes). I will also end the day here and pick back up on it later this week.

alejmedinajr commented 6 months ago

Starting back on this

alejmedinajr commented 6 months ago

Current update: Restarted how I was creating the dataset, I am now doing it in a manner that I think makes more sense. Before I realized I had the data being appended in an inner loop, which was partially the reason why the dataset was larger than it should have been (in terms of rows).

alejmedinajr commented 6 months ago

Current update: I now have a closer version of the dataset that was drawn on the chalkboard during class. The next step is to feed this dataset to the machine learning model file. I would say it is interesting that the sequence comparisons seem to always return 0.0 for the first two sequence comparisons, and 100.0 for the third sequence comparison. I think this is because the third is an approximation algorithm used to compute the other two at a faster speed. Moreover, I think this is because the key words that one would assume for computer science assignments (i.e. for, while, if, class, etc) are not being removed from the file. Also there seems to be white space in between the place where new lines used to be. I think this is something that can be addressed during group work. I say this, because I do not think this would be an issue for assignments that are outside of cs (i.e. papers/short answers).

alejmedinajr commented 6 months ago

Current update: I modified models.py so it can handle data formatted how we would ideally format the data, and it seems like we definitely do not have enough data for training. Our data is overfitting, which means it performs extremely impossibly well (100% for three different types of models). The logistic model is ballparking ~72% accuracy, which is still a sign we have very little data used for training. I know that we discussed allowing every professor to have their own model, with hyperparameters tuned to the assignments they upload, but the lack of data by doing that would mean our models would have a very shaky accuracy. I think this may be one of the limitations of the project, considering we do not have a large amount of data to work with, and creating this data is time consuming. I think this is where the ability for the user to see the values computed for these comparisons comes in handy.

alejmedinajr commented 6 months ago

Current update: I spent more time messing around with models.py. One of the things I messed around with was looking for meaningful features out of the ones we are using. It seems the filename was contributing to the model's accuracy, so I removed it because the filename should not have any impact on the score. Once I removed it, the accuracy for the other three models went down from 100% (the accuracy of these models still seems to represent a scenario of overfitting). I also troubleshooted by trying to understand more about the performance of the models. I added code to extract the feature names and coefficients being used in the model (only for Logistic Regression so far), and this is what I got:

This image represents the features being used to determine the output of the [logistic regression] model. Each feature's coefficient value indicates its impact on the prediction outcome where negative coefficients (i.e. 'Fuzz Ratio', 'Cosine Comparison', etc.) suggest that higher values of these features are associated with a lower likelihood of the positive outcome (I believe in this case the positive outcome is AI? I do need to doublecheck this though). On the other hand, positive coefficients (i.e. 'Fuzz Token Sort Ratio' and 'Fuzz Token Set Ratio') imply that higher values correspond to a higher likelihood of the positive outcome. I also never really explained what the other values mean, so I will explain these so anyone here looking at this thread can follow along without necessarily having prior knowledge with these performance metrics.

Accuracy is the proportion of correct predictions overall (on the testing data)
Precision (the ratio of correctly predicted positive observations to the total predicted positive observations)
F1-score (the harmonic mean of precision and recall) metrics indicate the model's performance in classifying positive instances
ROC AUC score measures the model's ability to distinguish between positive and negative classes across various threshold values
Hyperparameters were tuned using cross-validation, with the best regularization parameter (C) being 1 (this is less important to know as it is part of the automated process provided by GridSearch

alejmedinajr commented 6 months ago

Current Update: I messed around with L1 and L2 regularization, along with the type of solver that is used for the Logistic Model. Here is the performance of each configuration (different methods means different models).

The reason for messing with regularization techniques such as L1 and L2 was to prevent overfitting and improve generalization performance. L1 regularization (aka Lasso regularization) adds a penalty term proportional to the absolute value of the coefficients (this shrinks less important coefficients to zero). L2 regularization (aka Ridge regularization) adds a penalty term proportional to the square of the coefficients (this penalizes large coefficients and reducing their impact on the model).

Solver methods are just the ways the model aims to solve the classification problem. Here are some of the differences of the solver methods used in the image results:

Stochastic Average Gradient (sag) descent: a variant of gradient descent suitable for large datasets (we have a small dataset but it was easy to implement so why not).
"saga" is an improved version of "sag" that includes additional enhancements for handling sparse data and large-scale problems efficiently
lbfgs (Limited-memory Broyden–Fletcher–Goldfarb–Shanno): suitable for small to medium-sized datasets and is often the default choice in scikit-learn for logistic regression when the problem is not too large.
iblinear: suitable for both binary and multiclass classification problems and works well with sparse datasets.

Overall, I just wanted to play around with L1 and L2 regularization, which did not make a heavy impact (judging from the image). At some point, we should talk about the choice between L1 and L2 regularization. At the present moment, that can wait, we just need more data it seems, so that we can see how more data changes the model performance (and whether our features are good).

alejmedinajr commented 6 months ago

Current update: I decided to watch this 30 minute video about machine learning with small datasets, considering this may be something we may have to account for (if we cannot make enough data before the semester ends).

Here are some of my key takeaways/notes while watching the video:

There are inherent challenges posed by small datasets in the realm of artificial intelligence and machine learning. Limited data availability can lead to certain difficulties (i.e. model overfitting, reduced diversity in the dataset, and inability to learn complex patterns). Most of this was not a surprise to me, but I think it is still important for anyone else who is reading this thread to know this.
There are various preprocessing and augmentation techniques tailored to enhance small datasets' quality and quantity. There was also an emphasized importance of removing noise and irrelevant information. Some of the augmentation techniques mentioned included data rotation, flipping, and cropping (all of which artificially inflate dataset size while preserving its integrity). At the present moment, the data in our code is flipped (I believe), and irrelevant information is also removed on the predictor side (such as the filename). On the other hand, we still need to address the important keywords that are not removed in the preprocessed text (but this is very niche to cs assignments at the moment). Maybe this could be done with the introduction of users in the database (in which case the professor users are the ones determining keywords they want to be preprocessed).
Transfer learning was also something mentioned in the video, but I do not think it applies to our project. I could see an argument for having one main model that is trained, then having this model learning transferred to individual user models, but this seems like more work than we have time for in the semester.
There was discussion on the trade-offs and limitations associated with employing different techniques for small dataset problems. The speaker highlighted considerations such as increased computational complexity, potential overfitting, and the need for domain-specific knowledge when selecting appropriate methods. I think one of the main problems we may run into is the potential overfitting and domain-specific knowledge. Apart from these two main limitations, I think the other ones mentioned in the video may not necessarily apply to our project.

alejmedinajr commented 6 months ago

Current update: I am currently looking into ways to filter common words out in the preprocessing stage, mainly because I want to see how this affects the models (considering all of the data in the current dataset we have are programming based). I found the following packages, and I am in the process of implementing this.

The library I found for this was Natural Language Tool Kit (nltk) which has the following (potentially useful) functions:

stopwords: A way to remove irrelevant words (i.e. a, the, is, etc.). This supports 16 different languages (from what I read), and can be customized to include more or less words. I think this will be important for writing assignments, but not necessarily programming assignments (unless we want to remove comments).
word_tokenize: This may or may not be helpful, mainly because it separates a single word into its syllables. This may be useful for making the comparisons more accurate, but at the same time, this should be something we look into only if we have time. I am just listing it as a reminder that it is here if we want to use it for whatever reason (though highly unlikely).
PorterStemmer: This is an interesting way of only keeping the word stem, and not any affixes. This is another example of something we should look into near the end if at all.

Overall this package is pretty cool for more advanced preprocessing, but for now, I am just planning on using the re (regular expression) package that is already being used. I am working on defining a dictionary of common cs words so that I can implement this.

alejmedinajr commented 6 months ago

Current update: I spent time first thinking of common cs words that should not add any value to similarity, although some of this may be hard considering the method name will likely be the same, as well as the class name for multiple solutions (meaning if a solution is 10 lines and 30% of it accounts for class and methods, the degree of similarity will be more similar). At the same time, the length of the text as a feature we are extracting should make up for some of this. Nonetheless, here are some of the words I defined as words that should be removed from cs assignments during the preprocessing phase.

After I running models.py on the newly created dataset, it seems this increased DecisionTree metrics, but the rest of the models are pretty much nearly the same in performance.

This is kind of disappointing to see, but it makes sense considering the only things that changed were the feature values in the dataset (though not by a lot). It is interesting that the sequence matcher methods (three of our features), do not have any distinct values for the first two (0%) and third comparisons (100%). Removing them does not really change anything, since useless features do not contribute or take away from a model's performance (for the most part). I need to trouble shoot why this is happening, but that will likely be a problem for the future (since the focus should be more on the dataset size and model, and not these features).

alejmedinajr commented 6 months ago

Current update: I am in the process of cleaning up the files I messed with today (removing old code, documenting new code), and I seem to have broken something in parsing.py. This must have happened when I was refactoring the new function I made that is responsible for creating the dataset or the old code I commented out or deleted that was not being used/not working before today.

alejmedinajr commented 6 months ago

I found the issue, I removed an essential variable. I fixed this, added the documentation, and pushed the file. I think I am done working on this for today.

alejmedinajr commented 5 months ago

Starting on this again. Now that we have the user flow pretty much figured out, I can focus on calling and using the ML code with the submitted/uploaded content.

alejmedinajr commented 5 months ago

Update: I ended up actually testing the code and looking at the user flow. It seems there are some issues with some of the errors we show (like a user trying to signup when they already exist or not a vlid email being used) that are not as clear as they could be. I also found that the button for switching between gemini and chatgpt does not work as intended (only chatgpt is shown). I also spent some time thinking of how the "create your own model" would work.

alejmedinajr commented 5 months ago

Update: I was running into a problem but it was because I was missing a directory for uploads. I changed the code in API to check for this and create the folder if it did not exist so no one should face this error again if they forget about that. I also noticed that the parsing seems like it does not make any type of connection to the react app, and I verified this. I think this made sense because we want to return the percentages and cool stuff like that. I think I need to create a function in API.py that utilizes the ML and test comparison stuff so that it can be called from both the fileupload and formsubmission functions that are our FastAPI endpoints. Otherwise we would have superfluous code.

alejmedinajr commented 5 months ago

Update: I am still working on creating the function for "generating the report" the user sees after submitting an assignment. At some point, an additional file upload area and text input should be added (so we have a distinction between instructions and student submissions). Doing so will make the product easier for users to understand and use (as well as easier delineation for us on the backend). My current approach has the end goal of sending a dictionary that can then be generated on in the react app.

alejmedinajr commented 5 months ago

Current Update: I finished writing the method for generating a report in terms of just the cosine comparison, and I thought I could test it with just regular text, but this proved to be harder than I expected. I think this is a sign to work on the front end to include the ability to upload the submission file/text as well, in order to test the method as is. The good news is once I do this, the other part should be easier mainly because I just need to make function calls!

alejmedinajr commented 5 months ago

Starting on this again.

alejmedinajr commented 5 months ago

Update: I made the changes so there is now a generate report button next to submit prompt. This is the first step in generating the report. It currently works without the machine learning model by simply spitting out every metric in a dictionary (which is okay for now). The purpose of this is to allow users to not have to wait for 100 iterations of calling the APIs (unless they want to).

alejmedinajr commented 5 months ago

Update: I now have the generate report button redirect the user to account page, where I plan to have a table of reports available (currently this does not work). I think the easiest way to grab the reports and store them is by using the Firebase Database, which might mean we should use that instead of the realtime database (at least for that portion since its connected to the user authentication).

alejmedinajr commented 5 months ago

Starting on this again

alejmedinajr commented 5 months ago

Update: I am implementing some more metrics into the model, mainly because I realized we could use more features, and I also found some new comparisons while I was reading something for my AI course final project. I also cleaned up one of the functions called genearte_report, that I created last class period. It was superfluous and nasty.

alejmedinajr commented 5 months ago

Update: I am about to test the new metrics I added. In the background, one of the pretained models that will be used (GlowVee? I don't know how many e's there are) is being downloaded so it can be used (this line is currently commented out).

alejmedinajr commented 5 months ago

Update: While testing this, I ran into a few issues, specifically with some of the packages I am using. I am working on resolving them, but the actual metrics are mostly being made (with some exceptions). After these are resolved, I can incorporate this into the model, which I anticipate having connected to the react app by end of today.

alejmedinajr commented 5 months ago

Update: I tried resolving the issues with the new metrics, but I decided to just move on since this can be revisited later once the rest of the functionality is all connected. On some other news, I tested the metrics by using an ai generated response to the prompt "write an essay about the importance of academic integrity" then I removed several lines at random. After this, I ran it through our react app, and these were the results:

Note that I still need to multiply the newly added syntactic similarity by 100 to give a percentage (and not the ratio since the rest are percentages). I also now need to pass this data to the react app.

alejmedinajr commented 5 months ago

update: I am reconstructing models.py, after working on the final AI project, there are some new tricks I want to incorporate or include that I think will be beneficial for the project. Currently, I am working on the feature extractor method, which was not existing but could be good.

alejmedinajr commented 5 months ago

Update: I think I am done reworking the models.py functions, and I am about to start testing it as well.

alejmedinajr commented 5 months ago

Update: I was testing the models, and I realized that I forgot to include the metrics as data columns, so I will do that now. I also need to decrease the step-size because these models are fully trained in milliseconds.

alejmedinajr commented 5 months ago

Update, I just realized a huge flaw in the machine learning model. The metrics are based on comparisons, which are used on comparing a similarly (actually AI generated) data point. This is a problem because the machine learning model is not trained on equal assignment types, therefore, it would not make sense to incorporate these metrics into the model UNLESS the model is trained in the background using larger instances of generated ai work on the (between 100-1000) on the same assignment. This would not only catch ai generated work, but also plagiarism. For this to work, it would require a long amount of time, and a customized model per assignment (otherwise the metrics would be useless). I think this means the metrics are to act as a supplement (outside of the machine learning) which is not a problem. I think this also means we need to grab more ai generated work and just make a massive folder of ai generated work and human generated work, in order to train the model. I think it would be great to use the metrics as a standalone report in addition to the model (to either support or provide closer to accurate results). This is not a major problem since this was close to what was envisioned in the beginning.

alejmedinajr commented 5 months ago

Update: I am done working on this for now. I did some more testing on the model end, and I also looked for possible sources of training data since we only currently have 40 datapoints (obviously not enough to train a reliable model). I also spent some time cleaning up models.py, which included rearranging, and removing functions that should be in other files instead. I will commit/push this code once it is better documented.

alejmedinajr commented 5 months ago

This works and is used by the react side to. The only thing that has not been implemented yet is the auto training data being updated feature, which is not a big deal (but it would be nice to have). There are thumbs up/down icons that are really buttons and allow the user to label their own data). The feature for creating a custom model is also a separate issue that should only be completed if time allows.

alejmedinajr / AI-Detector

Machine Learning Model #49