-
how should the input for this tool normally be processed? we need it to be at least tokenized and lemmatized; the identification of MWEs would also be of interest.
- lemmatization can be done with…
-
Hi team et al,
I'd like to know how to process bookcorpus to pre-training.
I am confusing to process this data.
Should I treat 1 book as a document including all sentences or 1 chapter as a docu…
-
The code for ExtractUCIStats.scala seems to process the tab delimited combined corpus and not the external WackyPedia corpus. Is there a newer version of ExtractUCIStats.scala that uses WackyPedia?
M…
-
The quantitative testing project at the CRS dev retreat in Nov 2024 (https://github.com/coreruleset/coreruleset/wiki/Discussion-Quantitative-Testing) revealed some false positives on 933160.
Here i…
-
### Add Link
https://pytorch.org/tutorials/beginner/chatbot_tutorial.html#chatbot-tutorial
### Describe the bug
I downloaded the zip and extracted it.
Now I got this error:
```
Processing …
-
This code "lyndon-factors" the first I know that tries to manipulate alphabets to change the number of factors.
I know this is aimed at biological sequential, but my application is text corpus and I …
-
Do not close this issue until all checkboxes below are complete or have been rescheduled:
List of corpora:
In [Processed OCR folder](https://github.com/CopticScriptorium/OCR/tree/main/Processed%…
-
From #8:
For consistency with the Softcite dataset, when a software is referenced implicitly, we should include in the annotated name chunk only the "noun" ("script", "package", "program"), for ex…
-
Before replying to a message sent to team (at) tatoeba.org, I tried to update this page, https://en.wiki.tatoeba.org/articles/edit/using-the-tatoeba-corpus , with the following text.
Clicking, "save"…
-
Having a look at https://github.com/ZUGFeRD/corpus/tree/master/XML-Rechnung, I am trying to determine which license statements hold true for the individual files.
What I have seen so far:
* Ther…