-
Here we combine all the datasets we can collect
- [OSCAR's CommonCrawl Dataset](https://traces1.inria.fr/oscar/)
- [Arabic BERT Corpus](https://www.kaggle.com/abedkhooli/arabic-bert-corpus)
- [Hi…
-
Hi,
I was tried to run the experiments on `run_rmu_zephyr.ipynb`, but for the evaluation, I was unable to use the same batch size as in the original code due to limited GPU memory. I was running th…
-
@PolMine For some reason this version fails for me:
```
---> Building R-RcppCWB
xinstall: mkdir /opt/local/var/macports/build/_opt_PPCSnowLeopardPorts_R_R-RcppCWB/R-RcppCWB/work/build
Executing: …
-
The link shared in footnote : http://www.statmt.org/wmt20/quality-estimation-task.html for downloading the "publicly available
bilingual corpora that were used to train the target machine translation…
-
bulk_extract_corpora and extract_corpora do not remove all lemmas and strong numbers from translations such as hbo_uhb and others from Door43
-
To do:
- [x] Get relevant n-grams of the corpora.
- [ ] Compare different n-grams for co-occurrence in both English and US corpora.
- [ ] Check out surprisal tool - used to be in NLTK. Find out why …
-
Should we add a separate flag for "only pretranslate"? Or should we automagically work if there is no matching corpora, we don't include the keyterms?
-
-
http://opus.lingfil.uu.se/
http://www.statmt.org/europarl/
How can we use them?
-
http://hdl.handle.net/11372/LRT-865
- [ ] Unclear annotation
- [ ] Missing licence