-
如标题所说, context_len参数同时控制了stream_generate_answer函数中token的len:
@torch.inference_mode()
def stream_generate_answer(
self,
max_new_tokens=512,
temperatur…
-
Thanks for your work!!
As shown in example.py, caption is under tensor format. So, do I need to create my own transformer-like model to transform a text format caption into a tensor format?
## Upvo…
-
Hi @benoitgaudou ! :wave:
Since https://github.com/gama-platform/gama/commit/7f328e8d1183cfa9098187dd9a1bc1090ee1b011 @AlexisDrogoul removed from StringUtils the `tokenize` function.
Therefore…
-
I write a lot with apostrophes and I can't figure out how to tokenize words like "don't".
I tried the following:
doesn\'t appears as doesn\
doesn\t appears as doesn\t
"doesn't" appears as …
-
`word_tokenize` keeps the opening single quotes and doesn't pad it with space, this is to make sure that the clitics get tokenized as `'ll`, `'ve', etc.
The original treebank tokenizer has the sam…
-
It seems missing the tokenize the audio (from 'input_ids') step both in finetune.py/finetune_low_resource.py of the LTU repo. Where is the detailed coding step for audio tokenization? I saw the 'load_…
-
This FTP diffing problem made me realize we should probably be splitting tokens in the HTML diff on periods (and maybe other punctuation?), not just on whitespace:
![screen shot 2018-11-21 at 9 04 …
-
### Overview
Sometimes a [ReadOnly]Span needs to be tokenized using more than one separator.
### API breakdown
```csharp
namespace CommunityToolkit.HighPerformance;
public static class SpanExte…
-
Would be good to have `__dask_tokenize__` methods added to `Array` and possibly `Group` classes. These methods are used to generate unique identifiers for different Dask objects. By default, they will…
-
**Describe the bug**
Supplying a relative path to the data downloader lays a trap for `tokenize_and_cache.py`.
**To Reproduce**
Call `jiant/scripts/download_data/runscript.py` to download some t…