Sstobo / Site-Sn33k

A collection of tools to rip webpages, and clean them for pinecone
120 stars 27 forks source link

Updated Code for PDF Parsing Affects HTML Parsing #4

Open alexander-singh opened 4 months ago

alexander-singh commented 4 months ago

The updated vectorizor.py embed code assumes a different train.jsonl structure than what is created in the chunker.py process. It appears the code was updated based on the new pdf-muncher file, but the structure is not consistent:

chunker.py creates items with a {id:"id",text:"text",source:"source"} structure

{
'id': f'{uid}-{i}',
      'text': chunk,
      'source': file_path
}

pdf-muncher.py creates items with this structure:

{
 'id': f'{uid}-{i}',
     'pageContent': chunk,  # Use the key 'pageContent' instead of 'text'
     'metadata': {
           'txtPath': file_path
      }
}

vectorizor.py expects the format to be the latter and returns an error when no pdfs are parsed

d-neri commented 4 months ago

FYI I got around this for now by using the older code in this commit: https://github.com/Sstobo/Site-Sn33k/commit/5b6121e55d75a79fb7e6c0eb1eabdeecf3deb8d5

Sstobo commented 4 months ago

Thanks for the feedback! Ill get on it as soon as possible