Twenkid / Vsy-Jack-Of-All-Trades-AGI-Bulgarian-Internet-Archive-And-Search-Engine

Artificial General Intelligence Infrastructure of "The Sacred Computer" AGI Institute : Custom Intelligent Selective Internet Archiving and Exploration/Crawling; Information Retrieval, Media Monitoring, Search Engine, Smart DB, Data Preservation, Knowledge Extraction,Datasets creation,AI Generative models building and testing,Experiments etc.
MIT License
5 stars 0 forks source link

Literature, References, Resources, Papers, Links, Links to Libraries etc. #15

Open Twenkid opened 1 year ago

Twenkid commented 1 year ago

Note, 4.1.2023: During this research effort I've been browsing, reviewing, visiting and revisiting, studying a huge amount of articles, concepts,, linked by association during browsing etc. for feeding ideas etc. The best would be to put them in some special representation, DB, semantic network etc.

So far starting with one out of many hundreds or maybe a thousand (so far) - well, a general curiosity, starting from that seed. This is a research & development project on its own, automatic analysis and learning assistant, reading assistant and accelerator, cognitive accelerator etc. An unpublished "in-house" project and experimental application, called [Research] Assistant or ACS in short (Assistant C#) which is a playground and inspiration for ideas and developments in these directions of "Cognitive Acceleration". In a broader sense, any computer and software is such a tool, though.

Various Statistical Similarity methods: https://en.wikipedia.org/wiki/Semantic_similarity A blog on Question Answering etc.: https://queryunderstanding.com/

Twenkid commented 4 days ago

Speech Recognition datasets etc. https://ai.meta.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/ https://arxiv.org/abs/2006.13979 https://ai.meta.com/blog/xls-r-self-supervised-speech-processing-for-128-languages/

Language Identification library: tested, use the small model

https://fasttext.cc/docs/en/language-identification.html https://huggingface.co/facebook/fasttext-language-identification

Common Crawl tools

https://github.com/facebookresearch/cc_net

Huge Dataset
https://github.com/togethercomputer/RedPajama-Data ... https://arxiv.org/abs/2007.10310

Twenkid commented 4 days ago

Bulgarian POS-tagger and NER-tagger: Applied https://github.com/AMontgomerie/bulgarian-nlp

https://github.com/AMontgomerie/bulgarian-nlp/blob/master/examples/pos_example.ipynb https://github.com/AMontgomerie/bulgarian-nlp/blob/master/examples/text_annotator_example.ipynb

About the Named-entity tags: https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

Twenkid commented 4 days ago

PHATGOOSE Repository

PHATGOOSE, which stands for Post-Hoc Adaptive Gating Over an Ocean of Specialized Experts, enables zero-shot generalization from specialized experts (eg PEFT modules) trained on diverse datasets by adaptively routing among them. It requires an additional, inexpensive training step of a gate in front of a frozen PEFT module for its corresponding task.

https://github.com/r-three/phatgoose

Twenkid commented 4 days ago

Pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

https://github.com/stanfordnlp/pyvene https://arxiv.org/abs/2403.07809