corpus-processing Search Results

own-pt/sensetion.el #107

corpus processing

how should the input for this tool normally be processed? we need it to be at least tokenized and lemmatized; the identification of MWEs would also be of interest. - lemmatization can be done with…

odanoburu updated 5 years ago

google-research/bert #906

Processing book corpus

Hi team et al, I'd like to know how to process bookcorpus to pre-training. I am confusing to process this data. Should I treat 1 book as a document including all sentences or 1 chapter as a docu…

ngoanpv updated 2 years ago

fozziethebeat/TopicModelComparison #1

The code for ExtractUCIStats.scala seems to process the tab delimited combined corpus and not the external WackyPedia corpus. Is there a newer version of ExtractUCIStats.scala that uses WackyPedia? M…

aneesha updated 11 years ago

coreruleset/coreruleset #3931

False positives with 933160 PL1 PHP Injection Attack: High-R…

The quantitative testing project at the CRS dev retreat in Nov 2024 (https://github.com/coreruleset/coreruleset/wiki/Discussion-Quantitative-Testing) revealed some false positives on 933160. Here i…

dune73 updated 3 weeks ago

pytorch/tutorials #2273

[BUG] - Chatbot Tutorial - Unterminated string starting at: …

### Add Link https://pytorch.org/tutorials/beginner/chatbot_tutorial.html#chatbot-tutorial ### Describe the bug I downloaded the zip and extracted it. Now I got this error: ``` Processing …

levalencia updated 2 weeks ago

amandaclare/lyndon-factors #4

Crashes processing text (information retrieval) corpus

This code "lyndon-factors" the first I know that tries to manipulate alphabets to change the number of factors. I know this is aimed at biological sequential, but my application is text corpus and I …

albertiniufu updated 4 years ago

CopticScriptorium/OCR #1

Publication thread summer/fall 2024 OCR documents

Do not close this issue until all checkboxes below are complete or have been rescheduled: List of corpora: In [Processed OCR folder](https://github.com/CopticScriptorium/OCR/tree/main/Processed%…

ctschroeder updated 3 weeks ago

SoFairOA/Dataset #11

Annotation Feedback - implicit mentions with nouns and chara…

From #8: For consistency with the Softcite dataset, when a software is referenced implicitly, we should include in the annotated name chunk only the "noun" ("script", "package", "program"), for ex…

lfoppiano updated 2 days ago

Tatoeba/tatoeba2 #3057

Wiki Page Not Editable

Before replying to a message sent to team (at) tatoeba.org, I tried to update this page, https://en.wiki.tatoeba.org/articles/edit/using-the-tatoeba-corpus , with the following text. Clicking, "save"…

ckjpn updated 5 days ago

ZUGFeRD/corpus #8

Unclear license statements

Having a look at https://github.com/ZUGFeRD/corpus/tree/master/XML-Rechnung, I am trying to determine which license statements hold true for the individual files. What I have seen so far: * Ther…

stefan6419846 updated 5 days ago

1000+ results
for corpus-processing