JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

Vocabulary extraction #280

Open atantos opened 5 months ago

atantos commented 5 months ago

Dear maintainers. (@rssdev10 )

One of the common challenges I faced with the StringDocument type was extracting a vocabulary in the form of an OrderedDict for use in the cooccurrence matrix. Previously, it was necessary to create custom functions for the OrderedDict. A ready-made vocabulary dictionary would be an extremely useful tool for various tasks involved in processing a corpus of texts or a single text.

I added the function vocab() with its docstring in the document.jl with two types of input (StringDocument and Vector{String}) that returns the OrderedDict of the vocabulary. I also added two tests in test.jl and updated the documentation in coom.jl for TextAnalysis.coo_matrix().

PS: Do not know why I get the UndefVarError: OrderedDict not defined error in the tests. I thought that the DataStructures were imported. Locally, it runs ok!

rssdev10 commented 5 months ago

Hi Alex, thanks for the PR.

atantos commented 5 months ago

Hi there!

Thanks for the comments!

  • Please don't mix style changes with functional changes. There are a lot of changes in your PR that are not really changes. Style fixes should be merged into separate commits with no other changes.

Sorry for teh style changes. They come from vscode editor automatically saving these kind of changes. I guess will have to undo these specific changes that it does (deleting tabs in exports, most of the times).

Sounds like a good idea. On the other hand, this is the place related to language modeling. Am not sure either. I wll give it a thought and will come back soon.

  • The vocab name is confusing. Maybe ordered_vocab would be better for your case.

Indeed! ordered_vocad sounds find. (a more general question: naming conventions do not follow Julia's guidelines, if I am right?)

  • The error in the unit test occurs because document.jl doesn't contain using OrderedDict. Use the package test locally. You can activate it with using Pkg; Pkg.test() or run from a package manager with the ]test command.

So, I guess this means that in order to have it pass the tests, I simply need to use the using statement in the beginning of the file just above @testset, right? It seems that was the case and it worked.

  • Whenever you add documentation, use julia-repl keyword for the REPL examples. Or use

Thanks! I updated the comments.

  • You can check documentation locally with include("docs/make.jl") from REPL.

Nice!

rssdev10 commented 5 months ago

Sorry for the style changes. They come from vscode editor automatically saving these kind of changes. I guess will have to undo these specific changes that it does (deleting tabs in exports, most of the times).

In the VS Code there is a tool for doing partial commits. https://www.youtube.com/watch?v=sYTwr1OSUlo

Regarding names conversation. Technically, there is https://docs.julialang.org/en/v1/manual/style-guide/ . But the strict requirement for function names is no longer present. In most scripting languages, the snake style of function names is considered more readable than the initial math style of names without any separators in Julia.

So, I guess this means that in order to have it pass the tests, I simply need to use the using statement in the beginning of the file just above @testset, right? It seems that was the case and it worked.

correct. Also, my personal preference is to always have independent test files and be able to run them separately. This makes it easy to develop and debug your own code when you are working on a few specific functions.

Sounds like a good idea. On the other hand, this is the place related to language modeling. Am not sure either. I will give it a thought and will come back soon.

It would be good to take a pause and review this again. We need to avoid spreading similar functionality around the codebase. But from another side, some changes could be made to the existing structure.

atantos commented 5 months ago

In the VS Code there is a tool for doing partial commits. https://www.youtube.com/watch?v=sYTwr1OSUlo

Thanks! I will check out the video and install it.

Regarding names conversation. Technically, there is https://docs.julialang.org/en/v1/manual/style-guide/ . But the strict requirement for function names is no longer present. In most scripting languages, the snake style of function names is considered more readable than the initial math style of names without any separators in Julia.

Snake style it is then. Thanks for giving me the background!

correct. Also, my personal preference is to always have independent test files and be able to run them separately. This makes it easy to develop and debug your own code when you are working on a few specific functions.

I think you are probably right. Having them separately is clearer.

It would be good to take a pause and review this again. We need to avoid spreading similar functionality around the codebase. But from another side, some changes could be made to the existing structure.

Yes, I agree with both points. I'm planning to conduct a survey comparing Python and R's analogous packages/libraries, and I'll return with suggestions. I believe Slack is a better platform for discussing package design.

However, I have two preliminary points:

1) My background is in corpus/language data analysis, and there's a large community in the humanities without prior programming experience who want to use programming tools for tasks like statistical tests and modeling. They prefer tools that simplify their work, avoiding the need to switch between multiple, distinct packages. So, their workflow needs user-friendliness. TextAnalysis.jl is excellent for those who understand some underlying mechanics, but most users need straightforward tools that perform exactly as needed. R, currently popular in the humanities, offers this simplicity with one-liners for specific tasks. Users don't delve into the underlying functions and still achieve their goals, even if it takes slightly longer. So, how do we encourage them to try new tools? In my opinion, providing easy-to-use tools for tasks they're accustomed to in R/Python, and then giving them the perspective of having so many other utilities and flexibility with types and their hierarchies, clear code with multiple dispatch etc. The first step is to offer them familiar one-liners from R/Python to ease the transition.

2) Building on point 1, regarding the ordered_vocab() function, having a user-friendly vocabulary function is crucial for various types of language data analysis, like measuring complexity and/or associations based on cooccurrence matrices in various fields. Until now, users of TextAnalysis.jl would have to create their own vocabularies, a daunting task for many in the humanities already overwhelmed by R's one-liners. Integrating ordered_vocab() (i suggest we decide for a temporaray solution wrt the location but still do the merge so that I can prepare a tutorial or two on using TextAnalysis.jl soon) and I would go one step further and suggest that even hidding the vocab argument from the user (creating the vocabulary by default) would simplify usage, eliminating the need for users to manually create and add it as an argument to functions such as TextAnalysis.coo_matrix().

I realize these suggestions involve significant changes and discussions. But, having worked in the humanities for years and conducted tutorials for various audiences (e.g. philologists, linguists, teachers, psychologists, historians, librarians), I believe my insights are valuable in attracting people from these fields.