-
how should the input for this tool normally be processed? we need it to be at least tokenized and lemmatized; the identification of MWEs would also be of interest.
- lemmatization can be done with…
-
Hi team et al,
I'd like to know how to process bookcorpus to pre-training.
I am confusing to process this data.
Should I treat 1 book as a document including all sentences or 1 chapter as a docu…
-
The code for ExtractUCIStats.scala seems to process the tab delimited combined corpus and not the external WackyPedia corpus. Is there a newer version of ExtractUCIStats.scala that uses WackyPedia?
M…
-
Hi,
Great work!
I'm currently working on a project where I need to generate a custom description dataset similar to the one used in HumanML3D. I noticed that you've made changes to your own combat…
-
https://research.sign.mt/#list-of-datasets has a number of TODO items for:
- [x] ASLVD https://github.com/sign-language-processing/sign-language-processing.github.io/pull/44
- [ ] ATIS
- [ ] [AU…
-
I think it would be good to include the current size of the corpus (as of date _x_) on the README.
When I started processing, I wasn't sure how much hard drive space I would need.
FWIW, I downlo…
-
This code "lyndon-factors" the first I know that tries to manipulate alphabets to change the number of factors.
I know this is aimed at biological sequential, but my application is text corpus and I …
-
i have this zstd js
https://github.com/101arrowz/fzstd
to render decompressed zstd on browser.
was wondering if there are "officially" supported english language / dictionary corpus (good enough f…
-
Currently, [`current_corpus_idx`](https://github.com/AFLplusplus/LibAFL/blob/0777873aaef62e309075cb4b64ef04e0b0124afe/libafl/src/corpus/mod.rs#L186) returns an `Option`. However, most places where `cu…
-
To fuzz font processing, such as loading glyph outlines, we would like to have two inputs:
1. The usual `data: &[u8]`, mutated from a corpus entry
* data is thus relatively likely to be a somew…