-
```
Purpose of addition of this task:
acquire Inuktitut corpus which is large enough to test multilingual open source
text processing tools for Inuktitut apps
When reviewing task, please focus on:
s…
-
I forked to poke around and see how some of the natural language processing stuff was working. The API endpoints are requiring a file `wordplay.js` that doesn't exist in the repo that seems to be doin…
-
On large corpus, irrespective of what Voyant Server is doing, the status is 'Uploading'.
Perhaps a bar graph, or status bar of the upload, and then a status change to 'processing', or similar, would …
-
> - “Linguistic research is multifaceted and spans diverse areas such as corpus analysis, conversation/discourse analysis, experimental research, and more.”
--> This is a very reduced list of linguis…
-
This part involves
1. researching and extracting stop words for example
```[“document”, “story”, “machine translation”, “translation”, “figure”] -> [“machine translation”]```, performing NLP analy…
-
Hi
So I was training a new tokenizer from Llama Tokenizer (meta-llama/Llama-2-7b-hf), on a medium sized corpus (Fineweb-10BT sample : 15 million documents with average length of 2300 characters). A…
-
Hi, I am using this fantastic tool to generate tests that are better understandable than the original EvoSuite. However, I encountered an issue when using it. I also created this issue on [UTGen/UTGen…
-
I am investigating performance problems in `load.corpus`. I think that performance could be improved significantly by replacing `scan` with another approach to loading files.
This flame graph from …
-
A lot of the time when I talk to the model it replies with simply _UNK. Especially for quite short queries. When I train with my own larger corpus it does this a lot more than with the movie corpus al…
-
Hi
I’m conducting research regarding OCR corpuses, and I would like to use this project for evaluation of how differences on the training corpus effects the quality of the post-processing.
But, I ha…