Open antmarakis opened 6 years ago
Hello @MrDupin! I have worked in NLP problems and I can give you some help here. :)
That's great! You can work on them either on GSoC or on your own. Whenever you want to get started, don't forget to post a comment here explaining in short what you want to do.
@MrDupin With reference to point 3,I would like to include support for both N-gram and Bag of Words models.I am mentioning the same in my proposal.
@ashwinnalwade Awesome! Good luck with your application!
Hello, Reference to the 2nd point . We can do some preprocessing task like ranking words/sentence in paper with reference to frequency of word/sentence occurence and uniqueness of each word by considering ranking factor like the length of word/sentence.
Hello, I would like to work on the first point on implementing log addition if no one is doing it yet. We can retain the decimal library Naive Bayes implementation and add logarithmic version with it. It might be good to compare both approaches in the notebook.
Sure @dsaw you can get started, nobody is working on this yet.
@MrDupin I did some work related to language identification during my GSoC (the year before yours). This follows from an example in the book. If anyone is keen on taking this up - here is the notebook: https://github.com/reachtarunhere/aima-python/blob/lang-id/lang-id.ipynb
If not, I would be happy to clean it up and get it merged myself.
@reachtarunhere I only skimmed through it, but it looks good. Maybe the current GSoC students can take it up?
@MrDupin I would have loved to take it up but I have no knowledge of NLP. I'll definitely come back to this once I get acquainted with the topic.
@ad71 or anyone else taking this up - feel free to get in touch if you need any help/clarifications. In case I miss this here (lots of notifs) you can always drop an email :)
hi @MrDupin I have some experience in nlp would like to solve the issue if this issue is still open I would be working on case 2 , would like to use different methods like bag of model.
@llucifer97: You can definitely work on the issue! Simply fork the repository, make your changes to the notebook, and submit a Pull Request.
If you need any help, feel free to ask (although I am swamped with work at the moment so I am not sure I will be able to provide much help).
Hello @MrDupin.
Would I be right in assuming that this isn't an issue that someone completely new to opensource and AI-related topics can handle? I am well versed with Python but have never dabbled with this sort of thing.
Apologies if this is a very trivial question.
@aditya-hari: Hmmm, I'm not sure... I don't think any past experience with open-source is necessary, but I believe some understanding of the AI concepts covered in this notebook is a must. If you want you can take it on, but it seems to me that it will be difficult to make much progress efficiently, mainly because not only do you have to execute the concepts, but also showcase them in a manner in which others can understand them.
It is up to you, but I first suggest you read up on NLP and some basic AI material in order to be better equipped to tackle this challenge.
Is encouraging to know that past experience isn't necessary.
Could you kindly guide me to some resources which can help me get up to speed? Or point me to some other issues which would be easier to start off with?
Would really appreciate any help.
Hi @aditya-hari, sorry for taking so long to get back to you.
The way this repository works no longer involves issues (there are exceptions though). The main work remaining is to simply add to the notebooks.
For NLP, I am afraid there's not much I can say. Personally, I was just looking around the internet for information, I was never taught anything in a class. I suggest you snoop around and try to find a university course or something that is open to the public. Then you can follow its structure, complementing your studying with googling.
Hi, @MrDupin I have never contributed to open source before this so pardon me if i say anything trivial...but can you please tell me that how could I contribute to your projects, I do have some NLP experience, have implemented a research paper to code on text analysis, so can you please tell me that what contribution I could do.
@vaibhavshukla182: Hi! Since most of the algorithms in this project have been implemented, I suggest you turn to providing examples for the notebooks. Maybe some text analysis, or a tutorial on NLP techniques, or stuff of this nature. Remember that this project is educational, so the more examples we have the better. Just don't forget to include some instructions on how your examples work.
Also, I believe you can implement algorithms outside the AIMA book as well. But only if they have short implementations and easy testing under pytest
. If you have any questions, feel free to ask!
Can I add some data preprocessing and data visualisation techniques? Or something like nlp for fuzzy string matching, or a comparison of various methods for text classification and which one is better in which situation.
@vaibhavshukla182: This sounds great! For the nlp stuff though, don't use any external libraries. We want to keep all the code right in front of the student, with no middle-man. Feel free to use visualization libraries (preferably matplotlib
) for the viz stuff.
@MrDupin I too want to work on the visualisation technique. I was thinking of implementing t-SNE for dimensionality reduction. I know this will not bring in great results for a big dataset. Also i want to add tf-idf and word2vec in addition to BOW and n grams. I can also work on with different algos on the same dataset like KNN and time split(temporal) data KNN and so on.
Also i think i might have to add some additional dataset on which i can do some super wised classification and do a polarity based t-sne visualisation as well as for KNN. Please see this for dataset recommendation: https://github.com/aimacode/aima-data/issues/10#issuecomment-436382620
@cursed4ever I think for now TF.IDF would be a nice addition, if you want to get started with it.
@MrDupin I am working on a toy dataset containing amazon customer reviews and implementing BOW, N-grams and tf-idf and providing visualisation for the same.
@cursed4ever This sounds good. Just make sure we actually have the right to publish and use the dataset you are working with, using proper attribution.
I am currently using a dataset published on kaggle so i believe it is opensourced and we can use it. Correct me if i am wrong.
I am not familiar with Kaggle's licensing terms, so I can't say for sure. Usually they will share their license in the dataset page, but you may need to ask them if you have permission to redistribute their data.
I asked the organisation, they said that their dataset is free to use for academic purposes. They only require to site their website in the publication.
OK, then it's fine if you use it here, thanks!
Hey, I hope this issue is still open. I'd like to add some pre-processing tasks + analysis of the text.
@Insiyaa sure, go ahead.
@MrDupin I'm unsure as to where @ashwinnalwade got to on point 3 but I'm happy to start this or continue from where he left off?
If not are there any other issues you'd suggest I get stuck into? I'm comfortable with python and have experience with AI (incl. NLP). Thanks.
@dave-light There hasn't been much progress on this issue. Feel free to work on whatever aspect you feel like and submit a pull request. @MrDupin can guide you further for NLP-specific queries.
@MrDupin sir I want to add genism and lemmatization in it
@hackerashish25 Although I am not @MrDupin if you scroll up, you can find in one of his replies the following:
@vaibhavshukla182: This sounds great! For the nlp stuff though, don't use any external libraries. We want to keep all the code right in front of the student, with no middle-man. Feel free to use visualization libraries (preferably
matplotlib
) for the viz stuff.
@MrDupin for the preprocessing of text we must remove numerics from the text since we are only dealing with textual data. So, should I go ahead and add this part to pre-preprocessing of data. Further, I couldn't find any implementation TF-IDF as mentioned by others above in the conversation. Is anyone still working on it? If not, can I start working on it?
@sagar-sehgal: I am not sure why we must remove numerics from the text. Is there a reason for that? If so, it would be great to do that.
Also, adding the tf-idf algorithm would be a great idea as well, even though we don't explicitly have the pseudocode in the book.
@MrDupin I have made 1 PR for text-processing . Now I would like to add tf-idf for text processing but can I use external library to do so
@hackerashish25 What external library are you planning on using? Quite a while ago I wrote an implementation for tf-idf, and it did not require any libraries, if I remember correctly.
Also, I believe you can implement algorithms outside the AIMA book as well. But only if they have short implementations and easy testing under
pytest
. If you have any questions, feel free to ask!
Hello @MrDupin, I want to contribute to aima-python. It seems most of the algorithms have already been implemented. Can I propose and work on some AI optimization algorithms? ( If adding new algorithms outside AIMA book is valid) I was suggesting to implement algorithms like Stochastic Gradient Descent, Particle Swarm Optimization, etc. I would also like to work on other notebook completions if the above algorithms implementation is not necessary for the project. Thank You!
Hello @MrDupin, I want to contribute to aima-python. It seems most of the algorithms have already been implemented. Can I propose and work on some AI optimization algorithms? ( If adding new algorithms outside AIMA book is valid) I was suggesting to implement algorithms like Stochastic Gradient Descent, Particle Swarm Optimization, etc. I would also like to work on other notebook completions if the above algorithms implementation is not necessary for the project. Thank You!
@Ask149: I am not familiar with Particle Swarm, but SGD is something that can be implemented here, sure!
Hey @MrDupin, I want to contribute to aima-python. I am interested and experienced in the fields of NLP, genetic algorithms, particle swarm optimizations, reinforcement learning and information retrieval. Any pointers you could give me on what project should I take up? I want to do it as my GSoC project.
Hello @MrDupin , Can I take on "sentiment analysis" for the apps notebook if no one is working on it? Also I would like to implement customisable vectorisers as utility functions that can come handy in many situations.
Hello @MrDupin, I am interested in starting with sentiment analysis and would like to contribute to the NLP Apps notebook. Any pointers you could give me on project selection, as I want to do this for my GSoC project.
Hello all (@JayantSravan, @ShaswatLenka, @aasthasood). Thank you for your interest in this project.
For sentiment analysis, you can pick a project and work on it. A popular one is the movie reviews dataset. I would prefer if you picked something else (you can research "sentiment analysis datasets" on Google). Pick a relatively small dataset and work on it.
Remember though that you cannot use any external libraries for the training, you have to do it yourselves to showcase the algorithm.
Hi @MrDupin, can you repost what else is left to be done i see log,visualization,preprocessing is already completed, can i go with sentiment analysis ?
@rushic24: Sure, you can do solve some sentiment analysis problems. Sounds good!
Hi @MrDupin , has anyone taken up the sentiment analysis example? If not, I can contribute since I have prior work on IMDB dataset, which I can probably re-use. Also, I can take up the explanation of the 'Question Answering' section in nlp.ipynb? Thanks.
Recently in the nlp_apps notebook I added a section on the Federalist Papers. What I did was write a simple workflow from start to finish. There is a lot of work to be done still and I am opening this to community contributions, since I believe it is a great way to get started with the applications notebooks.
A few ways you can improve the section:
DONE - One big issue with the Naive Bayes Classifier in this problem is that the multiplication of probabilities causes underflow (all the probability multiplications result in 0.0). That happens because examples are long texts. To avoid this, we are currently using the
decimal
module of Python. I believe this problem can be solved more elegantly using the logarithm of probabilities instead of probabilities. So instead of multiplying the probabilities, we add their logarithms.Do some pre-processing. Currently I only added a sample pre-processing step (removing one common word from each paper). I would like to see some other pre-processing tasks + analysis of the text. Which are the most common words for each author? Is it worth it if we removed the most popular words?
Right now we are using unigram word models. There are other options available too. I would like to explore this in the notebooks. Maybe an author likes using two words together. Maybe another spells some words a bit differently. I would like to see different models used/explored in the notebook, to let the readers know that they shouldn't rely on just one model all the time. We can even combine models together.
At the end of the notebook I note that the dataset is lopsided. We have way more information on Hamilton than the other two. Maybe it is worth adding some more writings from Jay/Madison to balance this out. I think it would be interesting to see if we could improve the results by using external data. This could come after the current section, so that we could compare the results.
Finally, maybe we can take a step back and try and classify all the Federalist papers, not just the disputed ones. Add a new section where we use external data to train our model and then try and classify the papers.
This is a big undertaking, and it doesn't need to happen on the particular problem. If you have a problem in mind, you can instead use the above ideas to tackle your own problems! Sentiment Analysis is trending right now, so maybe this is a place to explore some of the above.
All in all, I think this is a good project to chip in every once in a while and I hope it will serve as an introduction to the repository. Or maybe it will sound interesting to GSoC students who might choose to tackle this.
In any case, feel free to post here with ideas + if you want to start working on something.