-
@raver119
I'm using `ParagraphVectors` in an unsupervised (unlabeled way). Rather than having a neural net with say ~50k words / NLP features, I'm constructing ParagraphVectors of 200-1000d in len…
-
## Problem
The vocabulary design system is not implemented across the site and that makes the site inconsistent with Creative Commons design system. This will help create visual uniformity and acces…
-
consider using the Stanford Tokenizer for glove. in their paper they say "We tokenize and lowercase each corpus with the Stanford tokenizer, build a vocabulary of the 400,000 most frequent words" and …
-
## What problem are you trying to solve?
Many style guides such as Google's impost maximum column limits (https://google.github.io/styleguide/javaguide.html#s4.4-column-limit). There is tooling such…
-
### Feature request
Support decoding user defined added tokens that get added to end of the tokenizer's vocabulary for Whisper based models. This requires modifying the if statement in [_decode_asr](…
-
Hi, thanks for this tool.
I have a very large language model and want to filter it according to a target vocabulary, is there a specific format for the vocabulary?
If I have a test set, how to mat…
-
Recent works like YOLO-World and GroundingDINO mainly use Object365 and GoldG for pretraining. These methods do not use CLIP image encoder as the backbone (unlike some open-vocabulary detection method…
-
### What behavior of the library made you think about the improvement?
My understanding of the index construction process is that for each state in the FSM, we need to iterate through all tokens in…
aeft updated
7 months ago
-
Issues downloading models. all and base fail, but large-v3 works (Using conda env). Error:
(atrain_core_env) C:\>aTrain_core load --model all
C:\anaconda3\envs\atrain_core_env\lib\site-packages\aT…
-
At a [Hackathon session](https://cfconventions.org/Meetings/2024-Workshop.html) of the recent CF Workshop 2024, I began updating the code I created in 2020 to produce some visualisations of the standa…