-
In several texts (e.g. Urfa-107_Cotton_Business) one ELAN segment contains utterances of several speakers. It would be good to separate those:
* manually split ELAN segments
* replace speaker init…
-
@kaushaltrivedi
**While training on sagemaker i'm facing this issue**
- INFO - root - Writing example 0 of 6382
Exception during training: expected string or bytes-like object
Traceback (m…
-
Right now the tokenizer decode method supports only a single instance at a time. I think it would be good to have `batch_decode` function and also support `skip_special_tokens` and `clean_up_tokenizat…
-
RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 python cogagent_model_worker.py --host 0.0.0.0 --port 40000 --from_pretrained "saved_models/ScreenAgent-2312" --bf16 --max_length 2048
[2024-04-10 13:38:43,071] [INF…
-
1. need to format debates into a better format than one block of text
2. need model / LLM to learn debate vocabulary and fix transcripts
3. delete repeats
4. make LLM "flow" -- store each argument in …
-
To support tokenization in NewDot which is adding a virtual card to Apple / Google Pay, we will need access to some native methods.
Problem:
Apple Pay and Google Pay are table stakes in the card …
-
http://ssptool.securitycentral.io/certifications/FedRAMP-low/NIST-800-53/IA-5
"h. Protecting authenticator content from unauthorized disclosure and modification;"
If usernames stored in database…
-
**pos:**
depends...
**Tokenization:**
_always joint?_, i.e. keep punctuation within the token.
**Dependency:**
depends... currently used in rev. freq. order:
```
28 nmod
9 name
…
-
In `test_fstring` the test `test_comments` fails. We need to check if is due to regular tokenization problems and the test needs updating or we need to fix something regarding comments in f-strings
-
- [ ] Add a de-tokenizer API which returns the original string representation given tokens
- [x] Replace existing tokenization code in `script/bert` with the new tokenizer API