adrianco / meGPT

Apache License 2.0
225 stars 23 forks source link

plurality of formats #14

Open BlipBertMon opened 2 months ago

BlipBertMon commented 2 months ago

Hi Adrian.. Thanks your 192-foil Best Hits a repast from reality for an hour or so..

Pedantic issues & Impressions: Is this big enough BigData? training an LLM I thought requires many Gigs of fodder.. the many different formats also might confound; ...unless there's an LLM with polyglot format conversion capabilities at par or better than SearchIt of yore.. beyond just extracting text from PPTs and PDFs, the visually aphoristic style of presos, the missing duality of dialogue seems confounding of 'sensible' machine interpretations.

endless non-auto-wrapped lines in .txt of Medium posts make HiTL browsing of .txt files rather too challenging.

I downloaded pdfs, and .pptx's to view them, which github apparently doesn't grok - Guessing one could view the .txt files in a browser or autowrap in textedit.

all as far as 'issues'; Hope you are well.

  This adventure motivated me also to sign up for Mastadon.. lots of cats.  

Interesting I harbor a similar objective, training -something- on thousands of handwritten 4x6 notecards. The less-than chatBot objective: simple OCR of my own handwriting and fulltext search/retrieval ; maybe a tag Cloud Zettelkasten.

Hundreds of topics I'l like to bounce off you :^) guess those will have to wait

adrianco commented 2 months ago

Thanks for the input. The way RAG works is that it takes lots of diverse kinds of input, but not many gigs of it, and indexes it for retrieval. I figure if I make a nice clean well documented data set, the LLMs will figure out how to use it eventually.