WIP: Experimental AI Integration in QualCoder

kaixxx commented 8 months ago

I’m happy to share a new experimental version of QualCoder with some AI-enhanced functionality: https://github.com/kaixxx/QualCoder/tree/ai_integration

If you want to see it in action, check out the video: https://www.youtube.com/watch?v=FrQyTOTJhCc

I would really like to see this incorporated in the regular version of QualCoder in the future. But for now, I've created a seperate version so we can experiment without bothering the regular users of QualCoder. Since I'm using a seperate config folder, both versions should be able to run alongside each other without any problems. Also the database format hasn't changed. The only thing I do is to add an additional vector-database to any project opened with the AI-enhanced version. This is used for the semantic search. Since it resides in it's own directory in the project folder, it should also not interfere with the regular QualCoder.

Thank you very much for this great piece of open-source software! I'm curious to know what the QualCoder community thinks about my additions.

All the best Kai

ccbogel commented 8 months ago

Hello Kai.

I have seen your video and will download and try your code. Thank you very much for this, it looks like a great addition to QC functionality. Your programming skills are excellent :) as you have really understood the QC code and implemented the AI gpt4 which is currently way beyond my current skills. (Lots to read and try to understand).

I do package the software as an executable for Windows and Ubuntu, using pyinstaller. Many end users prefer this as it is easy for them to use QC. (This is also why I package the icons and language files as base64 - this worked better for me than trying to package data files within pyinstaller, the spec file I don't use, I think it was historical stored on the main code page - but I see in your fork this has been modified for pyinstaller use - so another thing for me to try out). My concern is, would it still package up in this way nicely for those users who cannot use the command line installation methods. I presume yes, but need to test.

And I'm glad you (and others) like using QualCoder.

with regards Colin

AndrzejWawa commented 8 months ago

When I tried, I had such an error:

obraz

amru39 commented 8 months ago

When I tried, I had such an error:

Ran into the same issue. Getting the same error for both free search and using defined codes

ccbogel commented 8 months ago

I think it might be best to release a 3.5 version without AI very soon. So that the more recent features can be out there 'in the wild'. Then incorporate the AI in the subsequent 3.6.

kaixxx commented 8 months ago

@AndrzejWawa @amru39 Ah, it seems that you don't have access to GPT-4. If we follow the link provided (https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4), we find the following disclaimer: "If you're a Pay-As-You-Go customer and you've made a successful payment of $1 or more, you'll be able to access the GPT-4 API"

Really annoying, sorry for that. Why does OpenAI give new users free credits if they can't use them on their state of the art model? So it seems that you must at least once pay for credits. From what I see, $5 (USD) is the minimum amount. Again, sorry for that, there is nothing I can do about it right now.

If you try, please report back in this fixes the problem, thanks.

kaixxx commented 8 months ago

I think it might be best to release a 3.5 version without AI very soon.

Yes, I would consider my AI functionality an experimental feature at the moment. Let's wait with the integration in the main version until it improves a bit. This is also why I tried to implement it in a way that you can use both version alongside each other. I'll try to keep up with the changes you make and rebase my version on QualCader 3.5 once you release it. Do you plan to make any changes to the database structure? This would break the compatibility.

I do package the software as an executable for Windows and Ubuntu, using pyinstaller.

Yes, it would be very nice to add binaries. I've created a binary + installer for Windows. My Linux skills are not good enough to do the same for Ubuntu. Having macOS binaries would also be very nice. My updated spec-file should also work on Linux/macOS (untested). I had to edit the spec file because there are no pyinstaller "hooks" for some of the very new ai-frameworks I use. Therefore, you have to tell pyinstaller about hidden imports and such through the spec file or via the command line. Other than that, I've tried to stick to your approach and even included my animated search icon as base64 ;)

Thank you Colin for providing and maintaining this great open source project with so much effort. It's really a shame that we have so little open source stuff in the field of qualitative data analysis. Especially with AI, it seems absolutly crucial for me that we can look at the code, see the prompts that are send to OpenAI or other AI providers and even alter them to fit our needs. It's about transparency and having control over the methodological decisions the software makes on our behalf. (Sorry for the rant...)

kaixxx commented 8 months ago

I've also created an issue on my fork about the "InvalidRequestError": https://github.com/kaixxx/QualCoder/issues/1 (Sorry for not having the issue-feature enabled on my fork. I didn't realize that this is the default setting for forks on GitHub.)

amru39 commented 8 months ago

@AndrzejWawa @amru39 Ah, it seems that you don't have access to GPT-4. If we follow the link provided (https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4), we find the following disclaimer: "If you're a Pay-As-You-Go customer and you've made a successful payment of $1 or more, you'll be able to access the GPT-4 API"

Really annoying, sorry for that. Why does OpenAI give new users free credits if they can't use them on their state of the art model? So it seems that you must at least once pay for credits. From what I see, $5 (USD) is the minimum amount. Again, sorry for that, there is nothing I can do about it right now.

If you try, please report back in this fixes the problem, thanks.

It started working once I made the payment (USD 5). Will try using AI search for a bunch of articles I am reviewing to see how it works and how many credits it uses.

ccbogel commented 8 months ago

OK. Well I think the plan for now is, I will try and update the translation files- EDIT - I will not update translation files. I am finding on Windows the two translation app just do not seem to work - xgettext and QtLingust. Then release 3.5 with Windows and ubuntu binaries very soon. I don't have a macOS to do a binary for that. Another contributor did provide a binary some time ago, but there must be differences between macs (e.g. intel or risc based), as it only worked for some and not others. @kaixxx There are no planned changes for the database. I think the latest version works pretty well now. Over the years changes had to be made, as understanding of what data needed to be stored was updated. You may have observed in the __main__.py MainWindow.open_project method, the modifications made to older database structures on opening them.

AndrzejWawa commented 8 months ago

I tested the AI and I think it works quite good. A cauple of searches it costed me about 0,40$, it is quite ok, I think. In general, for me the functionality is ready for implementation to the formal release. But, of course, there could be further improvements I suggested in issues for QC AI project.

ccbogel commented 8 months ago

Yes it works quite well for me also. I did a manual install on Windows 11. One issue that arose was the need to have long file paths enabled. So to get Windows users doing manual install to run regedit , find the LongPathsEnabled line and change the 0 to a 1. enable long paths 3

Enable long paths1

Two other comments:

The AI is presented as human 'I', 'we' - not keen on this but OK .
When starting to code text the AI begins the process of 'memorizing the documents' - maybe a bit more information here for the end user about - 'This takes time' and what exactly it is doing during this process.

kaixxx commented 7 months ago

Sorry for being so quiet over the last couple of weeks. I had a little winter break, but was also working on things in the background:

I’ve arranged a few workshops in Germany and Switzerland over the next couple of months to introduce QualCoder and it’s AI-functionality to more people. One was last week already; the feedback was very positive.
Together with a colleague I’ve created a binary installer for QualCoder on macOS (x86 and Apple Silicon). It’s working fine.
I’m in contact with a scientific supercomputing facility here in Germany, discussing if we could switch to open-source AI models hosted on their servers (instead of GPT-4). I’m thinking about mixtral 8x7b or the upcoming Llama 3. These discussions are in a very early stage, we will see. Imho, it would be nice to use open-source models with QualCoder, what do you think? They would be free to use, and the whole privacy situation regarding the data would be much better. This is actually a big concern for people I talked to here in Europe. They don’t trust OpenAI to not use their data for the training of future AI models, which I can kind of understand given the track record of this company when it comes to copyright, etc.

Other than that, I was always ready to fix bugs and problems with the first beta version. But nothing serious popped up which really surprised me. It shows how robust QualCoder is as a platform (and maybe also that I’ve learned from the early mistakes in my other AI-based software-project, noScribe).

For the future I have several ideas how to develop this further:

I would like to make the current AI-search function more targeted: The AI could ‘learn’ from the already coded pieces of data for a given code and focus the search in certain directions.
Already in the code is a new window where you can chat with the AI about your data, ask question etc. I have disabled it for now since it’s not finished yet, but I would like to continue working on that. I think it’s a nice way to get some first insights in your data before you really dig in (ATLAS.ti has something similar).

What do you think, should we focus more on integrating the ai-based functions as they are right now (with some minor improvements) in a next release of QualCoder? Or should we keep the ai-version separate for a little longer so that it can become more advanced and mature? I’m open for both directions.

kaixxx commented 7 months ago

Some quick comments to your suggestions:

need to have long file paths enabled.

In my case, "LongPathsEnabled" was already enabled. But even when I disabled it, QualCoder was working fine (thinking about it now: maybe I need to reboot). Do you remember which files caused the problems?

The AI is presented as human 'I', 'we' - not keen on this but OK .

I will change that, you are right. In an early stage of the development, I was thinking about implementing several ai-powered agents to chat with, maybe even giving them names... But I moved away from this idea for now.

When starting to code text the AI begins the process of 'memorizing the documents' - maybe a bit more information here for the end user about - 'This takes time' and what exactly it is doing during this process.

Yeah, I will try to explain this a little better in the status messages.

ccbogel commented 6 months ago

Hi there, yes, I think all of your ideas are good.

Yes an open source AI model would be better.

How does the AI work with the code name/memo? - I guess it is looking for word similarities. So in that case more end user instructions on filling out good code names/memos would be beneficial. From my perspective - with developing over the years - lots of details and user instructions are really important.

Chat with AI - yes I guess this could be a good feature.

Regarding the AI and chat: It this English language focused, or is there options to have it used in multiple languages? Just curious really.

Another AI related function - an idea only - might be to analyse images?

Yes if you feel the AI is good and the feedback has been positive, integration should proceed.

I am also wondering from the feedback from others - how best to further develop QualCoder: Other really useful functions to add, as it seems the big expensive proprietary brands do better (e.g. Nvivo, Atlas).

AND importantly, I feel, would it be better to have a bigger group working on QualCoder.
I have lots of limitations in my skills. You have your skills, etc ... I could add you to this github repository as a collaborator. And/or would it be better to have it moved or controlled by a bigger group, e.g. a university or research group so that it could be maintained and developed over time?

kaixxx commented 6 months ago

Hi Colin, great questions and ideas. Let me elaborate a little more on the inner workings of the AI search as of right now.

How does the AI work with the code name/memo? - I guess it is looking for word similarities.

The AI is looking for semantic similarities on the level of sentences. The process contains of three major steps:

1) "Memorizing" the semantic content of a document:

This happens every time you add a new document to your project.
The text is split on sentence borders into chunks of around 300-500 characters. A local AI model ("sentence encoder") then translates the semantic meaning of these chunks into a mathematical representation. You can imagine this as a vector in a three-dimensional space: If the semantic meaning of two sentences is very close, the corresponding vectors will also point to nearly the same position in space. If the semantics are different, the vectors will point in different directions. To make this vector representation more precise and nuanced, the mathematical space I use has actually not three, but 1024 dimensions. I use the following AI-model, which is multilingual and supports around 100 different languages (more or less): https://huggingface.co/intfloat/multilingual-e5-large (Since semantics are not directly language dependent, a semantic search can also work across languages. I.e., you can search with an English code name in a dataset containing Italian and Chinese documents. Quite fascinating actually.)
This vector representation of the semantic meaning is then stored in a special database called a "vector store" in the project folder for later use.

2) AI-based search:

The basic principle is to search for semantic similarities between the code-name and the chunks of data in the vector store. In practice, the process is a little more involved:

Using only the code name for the search leads to poor results. Comparing the semantic meaning of a single word (or a very short notion) to a whole sentence in the data doesn’t work well. This is especially true if we use less descriptive and more conceptional or theoretical notions to name our codes. To overcome these problems, we have to expand the semantic space of the code name and make it more descriptive before searching for similarities in the empirical data stored in the vector store.
To do that, I ask GPT-4 to create a list of ten sentences describing the code in simple terms. (If the user chooses to also send the memo to GPT-4, the AI will use this to understand the code better.) These ten sentences are then used to search for semantically similar pieces of data in the vector store.
As a result, I get ten different lists of chunks of empirical data which might be related to the given code. I consolidate these ten lists into one large master list, ranking data that appears in more than one of the lists higher in the consolidated master list (assuming that these pieces of data a more relevant for the given code).

3) Refining the results with GPT-4:

The result of step 2 will be a long list that still contains many pieces of data that are only marginally related to the code. To narrow the results down, I send the top 12 entries of the list to GPT-4 with a prompt to

sort out irrelevant entries,
rerank the entries according to their relevance,
explain the reasoning why the entry is relevant (this shows up in the UI as a tooltip “the AI thinks”),
select a shorter quote. The results are then shown in the UI. If you click on “find more” at the end of the list, the next 12 pieces of data are sent to GPT-4.

lots of details and user instructions are really important

Is the Wiki here on GitHub still the main user manual for QualCoder? I think I'm going to add two pages: One where I describe the background of the AI search (basically like above but with some additional methodological notes) and one where I go more into the practical side of coding with AI. What do you think?

AndrzejWawa commented 6 months ago

I think that integration with the next release will be more convenient from the users' perspective than keeping QCAI separate. As to data privacy – maybe there should be a warning window for the use of AI,, that data should be anonymised before AI analysis and that it will be send to the third party? Do you think that there is a risk of leakage?

kaixxx commented 6 months ago

Some thoughts on the future development of QualCoder:

would it be better to have a bigger group working on QualCoder.

Yes, definitely. This project is quite big for a single person to maintain. (I don’t know how many other people are involved right now.) But from my experience it is very difficult to find people that are both experienced in qualitative social research AND in programming.

It would be very good to have a couple of people who would volunteer to take responsibility for certain modules of the project. I could do that for the AI-integration (answering questions and bug reports related to this topic, plan next steps, keep the libraries up to date, etc.). Somebody else could be responsible for the macOS-version (testing, compiling, updating the manual…), etc. But as I said, it is not easy to find these people, I guess.

Other really useful functions to add, as it seems the big expensive proprietary brands do better (e.g. Nvivo, Atlas).

QualCoder has a lot of functionality, and you always add to it. I don’t think that’s a problem. When it comes to AI in particular, the functionality in the commercial software packages is often quite underwhelming, especially compared with the huge marketing promises they make. Look for instance at this critical assessment of the AI-based functions in ATLAS.ti: https://youtu.be/QwMe6akHhvY

If you want to achieve a wider adoption of QualCoder, I would suggest focusing on two key points: 1) A more intuitive user interface and easy access in general. As already mentioned, many people in qualitative research are not very technically inclined IMHO. This is also true for students. They need a nice installer (Windows/macOS) and a UI where they feel at home from the very beginning. Commercial software packages seem much more refined in this regard, I would say, because they can invest more resources here. (I have some small ideas for the UI and might suggest something, we will see.) 2) Collaboration functions for research groups working together on the same project. You have the topic on the first page of your wiki so I’m sure you are aware of this. (I also know that this is not easy to implement without risking data corruption…)

And/or would it be better to have it moved or controlled by a bigger group, e.g. a university or research group

I’m not sure. I’ve seen a lot of software projects where people get funded for one or two years, develop a prototype, publish a paper, and abandon the project shortly after that. The reason why QualCoder survived and continues to flourish is that there is a person behind it – you – that is really identified with the project and keeps it going no matter what. But I understand that this is also a burden. As I already said: It would be great to find more people that could contribute on a regular basis and would be responsible for certain tasks.

kaixxx commented 6 months ago

I think that integration with the next release will be more convenient from the users' perspective than keeping QCAI separate.

Good point!

As to data privacy – maybe there should be a warning window for the use of AI,, that data should be anonymised before AI analysis and that it will be send to the third party? Do you think that there is a risk of leakage?

No, I dont think that there is a risk of leakage, at least not in these particular functions that I use. As I explained in the video, any rumour in this direction would hurt the business model of OpenAI very much. But I can understand why people are generally a little suspicious when it comes to OpenAI and data protection.

Anonymising a whole interview is basically impossible IMHO, especially if you are working against a large AI-model that is very good at deanynomizing text...

ccbogel commented 6 months ago

This are not in my skill sets: 1 A more intuitive user interface and easy access in general. As already mentioned, many people in qualitative research are not very technically inclined IMHO. This is also true for students. They need a nice installer (Windows/macOS) and a UI where they feel at home from the very beginning.

For 2 Collaboration functions for research groups working together on the same project.

This could be possible if the database was used such as mysql, Mariadb or similar - which could be accessed at the same time across the internet. However, the down sides are: It would be a lot, lot harder to install than using the sqlite database that is currently used. A lot of testing would be needed to ensure that functions used by different people at the same time dd not clash, and the auto updating of the codes tree etc occurred. I feel this could be beyond my skills. One thing I do like about using the sqite database - is that it is easy to zip the project and unzip elsewhere for anyone to use.

Yes - the ongoing updating and responding to issues is becoming more difficult or burdensome.

ccbogel commented 6 months ago

@kaixxx Ok I have added you as a collaborator on the project. you will get a request from github.

kaixxx commented 6 months ago

A quick update, I have good news: It seems that we can use the open AI model Mixtral 8x7b on a server run by the Helmholtz Association in Germany. The project is called “Blablador”: https://helmholtz-blablador.fz-juelich.de/, maintained by Alexandre Strube (they also have an API). Helmholtz is a large association of German research centers in the technical/biological/medical field. They run a super computing facility in Juelich, Germany, where this server is also located. The project is a little experimental, but hopefully stable enough for us to use.

The model Mixtral 8x7b was trained by the French company Mistral AI: https://mistral.ai/news/mixtral-of-experts/. It is considered an “open-weight model”. We don’t get the training data, but the model itself is freely available under the Apache 2.0 license. The performance is on par or slightly better than GPT-3.5, not quite on the level of GPT-4. But from my initial tests it seems good enough for our purpose. The model handles English, French, Italian, German and Spanish. My idea is to keep GPT-4 as a second option, mainly because it supports more languages.

I see several advantages using Mixtral:

Data privacy: The server is GDPR-compliant, located in Germany and doesn’t store any of the data we send (except an ip address and login cookie). I hope this will convince more people that their data is save when they use the AI functionality of QualCoder.
Mixtral 8x7b is allegedly not “aligned” (or “censored”, as some people call it) in any way, unlike the OpenAI-models. So, no matter how nasty the topic of our research is, the model should not refuse to collaborate. (Although I never had a real issue here with GPT-4 either. Maybe I should do more nasty research...)
Accessibility: People can use the AI model free of charge. They only need to obtain an API-key from Helmholtz. I think this is especially helpful for students. You can login with your account from basically every major academic institution in the world (as far as I see), or use an ORCID, Google or GitHub account.

I have limited time right now because the new semester starts next week. But my plan is to try out the new AI model and work on integrating it in QualCoder over the next couple of weeks. This would then also be a good moment to add the AI functionality to the main version of QualCoder, I would suggest.

@ccbogel: Thank you for adding me as a collaborator!

kaixxx commented 6 months ago

(Sorry, closed this by accident.)

MicRaving commented 5 months ago

Some thoughts on the future development of QualCoder:

would it be better to have a bigger group working on QualCoder.

Yes, definitely. This project is quite big for a single person to maintain. (I don’t know how many other people are involved right now.) But from my experience it is very difficult to find people that are both experienced in qualitative social research AND in programming.

It would be very good to have a couple of people who would volunteer to take responsibility for certain modules of the project. I could do that for the AI-integration (answering questions and bug reports related to this topic, plan next steps, keep the libraries up to date, etc.). Somebody else could be responsible for the macOS-version (testing, compiling, updating the manual…), etc. But as I said, it is not easy to find these people, I guess.

Other really useful functions to add, as it seems the big expensive proprietary brands do better (e.g. Nvivo, Atlas).

QualCoder has a lot of functionality, and you always add to it. I don’t think that’s a problem. When it comes to AI in particular, the functionality in the commercial software packages is often quite underwhelming, especially compared with the huge marketing promises they make. Look for instance at this critical assessment of the AI-based functions in ATLAS.ti: https://youtu.be/QwMe6akHhvY

If you want to achieve a wider adoption of QualCoder, I would suggest focusing on two key points:
1. A more intuitive user interface and easy access in general. As already mentioned, many people in qualitative research are not very technically inclined IMHO. This is also true for students. They need a nice installer (Windows/macOS) and a UI where they feel at home from the very beginning. Commercial software packages seem much more refined in this regard, I would say, because they can invest more resources here. (I have some small ideas for the UI and might suggest something, we will see.)

2. Collaboration functions for research groups working together on the same project. You have the topic on the first page of your wiki so I’m sure you are aware of this. (I also know that this is not easy to implement without risking data corruption…)
And/or would it be better to have it moved or controlled by a bigger group, e.g. a university or research group

I’m not sure. I’ve seen a lot of software projects where people get funded for one or two years, develop a prototype, publish a paper, and abandon the project shortly after that. The reason why QualCoder survived and continues to flourish is that there is a person behind it – you – that is really identified with the project and keeps it going no matter what. But I understand that this is also a burden. As I already said: It would be great to find more people that could contribute on a regular basis and would be responsible for certain tasks.

@kaixxx Are you considering to add the collaboration feature? This would be a killer feature for university researchers. We're trying to use QualCoder on a project with five coders and the current version of QualCoder is not really practical.

kaixxx commented 5 months ago

Are you considering to add the collaboration feature?

@MicRaving: Let's continue the discussion about collaboration features here: https://github.com/ccbogel/QualCoder/discussions/894

surak commented 5 months ago

A quick update, I have good news: It seems that we can use the open AI model Mixtral 8x7b on a server run by the Helmholtz Association in Germany. The project is called “Blablador”: https://helmholtz-blablador.fz-juelich.de/, maintained by Alexandre Strube (they also have an API). Helmholtz is a large association of German research centers in the technical/biological/medical field. They run a super computing facility in Juelich, Germany, where this server is also located. The project is a little experimental, but hopefully stable enough for us to use.

That's me! :-)

The model Mixtral 8x7b was trained by the French company Mistral AI: https://mistral.ai/news/mixtral-of-experts/. It is considered an “open-weight model”. We don’t get the training data, but the model itself is freely available under the Apache 2.0 license. The performance is on par or slightly better than GPT-3.5, not quite on the level of GPT-4. But from my initial tests it seems good enough for our purpose. The model handles English, French, Italian, German and Spanish. My idea is to keep GPT-4 as a second option, mainly because it supports more languages.

Keep in mind that the number of GPUs I have is limited so far, so if newer, more interesting models appear, I remove the old ones. But to cope with code which I don't want to see broken just because of some change of mind, I made aliases which I intend to keep.

For example, as of today, Mistral 7B v0.2 is aliased as alias-fast, and Mixtral 8x7b is alias-large. I have also alias-code, for, well, code models. There is alias-embeddings, but that's broken for the moment. Well, you got the idea.

You can always query the models with a

curl -X 'GET' \                                                                                                                   ✔  19:18:14 
  'https://helmholtz-blablador.fz-juelich.de:8000/v1/models' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer glpat-MYTOKEN' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "aaagemma-7b-it",
  "engine": "string",
  "input": "string",
  "user": "string",
  "encoding_format": "float"
}'

It's nice that you find it useful, and hope this helps with stability!

kaixxx commented 5 months ago

That's me! :-)

@surak: Hi Alexandre, nice to see you here. I'm in the process of implementing the connection to blablador. I'm using alias-large already, not referencing mixtral directly. But the problem is that the prompts (which are embedded in the souce code) are carefully tweaked to work with a particular model and may break if you change to a different architecture (like Llama). It would be very helpful if you could at least announce such changes via the blablador mailing list so that I can test and tweak things if needed. Thank you for providing and maintaining this nice infrastructure!

surak commented 5 months ago

Sure, as soon as there's something better (was thinking about jamba and dbrx, but not yet), I will write on the mailing list right away!

ccbogel / QualCoder

WIP: Experimental AI Integration in QualCoder #875