Closed alejmedinajr closed 8 months ago
Starting on this now
Current update: I have finished looking into different possibilities for this and I think I will focus on the following: a standard vim difference, an interesting way of comparing text that I will include a picture on a later comment, and possibly some sort of machine learning technique. At this moment, I have implemented what I believe is a vim difference using the difflib package. I will test it and then work on the second if successful.
Current update: I tested the sequence test and according to it, two of my personal statements match ~46%, which I do not think is entirely accurate. I decided to try a cosine comparison, but this has even worse performance, with only a ~6% match of content, which I know is not accurate. I am going to continue messing with the cosine comparison because I think it will be more beneficial than the built in sequence comparison.
Current update: I was messing around with the sequence comparison, and I have modified it to a point where it now finds an ~84% match between two of my personal statements. This is definitely more accurate than before. I started by messing around with the code to find alternative ways to find comparisons and this did not work as well. I then looked further into the documentation and found alternative ratio functions that could be used with the matcher object. Of the three, the one currently in use returns 2*ratio which was what the original value was as an upper bound. Moreover, there is another function that computes a value of ~67%, which is less accurate in my opinion. I thought about averaging the values, but this would just be a value closer to 65%. This will definitely have to be investigated further when we have more data. For now, this is fine.
Current update: I have implemented three different string comparison functions. They all seem to have different performance on the same two files being compared (which have a high degree of similarity since they mostly the same). The sequence comparison similarity percentage is 84, the comparison similarity using the max value produced by fuzzywuzzy's string comparison is 89, and the cosine comparison similarity using vectorization and scikit packages is 93. It is important to note that all of these percentages are rounded to the nearest integer since this is likely how it will be presented to the user.
Current update: I think the next phase for this would be to make a machine learning model that uses logistic regression to determine the percentage. This would require a dataset with labeled data (as either not plagiarized or plagiarized) which would require work. I do think the current text comparison features we have seem okay, and I do think the cosine comparison uses existing machine learning tools (from scikit). Despite this, I think if we trained a model using labeled data, this would have better performance (or at least comparable). Moreover, these have only been tested with one example, so the dataset would need to be made anyways in order to test the accuracy of all comparison methods.
This is likely what I will look into next time I work on this problem: source
Starting work on this now
Current update: Put python full documentation for the file parsing.py
, which should be helpful for understanding the different ways of comparing text and why they are all there.
Current update: I read about more ways to compare text, and I think we have an appropriate amount of ways. I also did some research in order to get a better idea of how the ML model will work for this. Here are some of the packages I looked into that will be helpful for this:
scikit-learn: machine learning library in Python that provides various tools for classification, including support vector machines (SVM), decision trees, and random forests.
Natural Language Toolkit (NLTK): library for natural language processing tasks (i.e. tokenization, stemming, tagging, parsing, and classification). Offers tools for text comparison and feature extraction.
spaCy: NLP library in Python, provides efficient tokenization, part-of-speech tagging, named entity recognition, and dependency parsing. It can be used for text comparison and feature extraction as well.
TensorFlow using Keras: TensorFlow is an open-source machine learning framework developed by Google, and we can use Keras (a high-level neural networks API that runs on easily TensorFlow). From what I read, this would be more for deep learning.
PyTorch: Another deep learning framework that offers dynamic computational graphs and is widely used for natural language processing tasks. It provides tools for building and training neural networks for text classification.
This might be another decision that needs to be discussed as a team.
Current update: I spent time going through my branches and deleting branches that already have been merged into main (to avoid confusion and to also make sure I do not go back to those branches since they are definitely behind the main branch). Should those branches need to be remade, this should not be a problem (and the newly created ones will contain the up to date code from main). I also started writing a condensed testing guide doc that is in the shared folder. We are getting closer to the stage where testing can start on some areas, so this is important to have going forward. I also made a pull request for the code in this issue thread.
Current update: I moved onto the testing document in the google drive and started writing it. I go up to the frontend testing procedure, with stems of things to test. I will now start on the backend testing procedure. This document relies on the information from the 5th writing assignment, but it is in bullet point format with more specifics that way it is easier to read and easier to follow.
Current update: I finished up the testing document, and it is now reflective on how we will test along with some specifics that need to be tested. This should be useful for finding even more things to test while testing in the moment. It also delineates the different testing duties in a document that is not 11 point single spaced font paragraphs. I also spent time looking at the new code added to the react app (I have not touched it for a couple of weeks and it has changed quite a bit).
Current update: I talked on the phone for Travis for ~5 minutes to clarify some of the csv format information related to the testing data. Once we have the testing files made, I can test the text comparison as well as some team members begin working on the ML model. I also spent time looking at reasons why similar tools do not work as well. It seems that even OpenAI shut down their own ai tool classifier for its poor accuracy.
Sources read: https://arstechnica.com/information-technology/2023/07/why-ai-detectors-think-the-us-constitution-was-written-by-ai/ https://www.washingtonpost.com/technology/2023/04/01/chatgpt-cheating-detection-turnitin/ https://decrypt.co/149826/openai-quietly-shutters-its-ai-detection-tool
I mainly wanted to see if what we are doing makes sense still, and it does, since we are including the actual representative outputs. This is not something I have seen from other similar products.
This ticket is associated for looking into and then implementing possible ways to compare the submission input to the ai tool outputs. There are existing tools that do this, so that would be a good starting point. These existing tools are pretty bad in accuracy though, so we need to be creative in how we do this.