Open alexander-singh opened 4 months ago
FYI I got around this for now by using the older code in this commit: https://github.com/Sstobo/Site-Sn33k/commit/5b6121e55d75a79fb7e6c0eb1eabdeecf3deb8d5
Thanks for the feedback! Ill get on it as soon as possible
The updated vectorizor.py embed code assumes a different train.jsonl structure than what is created in the chunker.py process. It appears the code was updated based on the new pdf-muncher file, but the structure is not consistent:
chunker.py creates items with a {id:"id",text:"text",source:"source"} structure
pdf-muncher.py creates items with this structure:
vectorizor.py expects the format to be the latter and returns an error when no pdfs are parsed