-
Dear developers,
I did not find an option to limit the vocabulary. For example, I don't want to learn representations for words which occurs less than 50 in my corpus.
The reason is that if I use all…
-
The current list of Resource Types https://github.com/Sinar/sinar.resource/blob/main/src/sinar/resource/vocabularies/resource_type.py is very general, missing some commonly used types and often not sp…
-
## 🐛 Bug
The special large batch / alphabet handling, although can provide up to 2.5x speedup sometimes, it comes at the cost of up to 1.8x more memory. For large targets, this can be a significant…
ASDen updated
5 years ago
-
There are huge amounts of data being generated at hospitals every day. Up to 80% of this data is collected in an unstructured format and a large portion of it as free text. In order to extract value f…
-
This is partly a bug and partly a feature and it was discovered when I ran the tool through a subset of Gates publications from DCP, in particular only 4 publications.
The way the fuzzy matcher wor…
-
**System information**
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): mac m1
- TensorFlow version and how it was installed (source or binary): binary
- TensorFlow-Recommenders-Addons ve…
-
When applying pretrained models on real datasets, we often need to adapt the tokenizer and ensure that we can appropriately transfer the knowledge:
- Case 1: Trim the vocabulary
For example, …
-
The current text, while constructive and helpful, is slightly long on the different pages, especially when just entering Emnesøk. Also, not all concepts may be clear to a user unfamiliar with subject …
-
I am working on using NanoGPT to solve a geometry problem. I would like to use the gpt2 network structure but my own tokenizer. My vocabulary size is 1500. I have my own encode/decode code to convert …
-
What: WordPiece is an unsupervised multi-lingual text tokenizer. It is used in models such as BERT, though can in principle be used for many NLP models. It produces a user-specified fixed-size vocabul…