ChatWithPDF / ai-tools

AI Tooling to bootstrap applications fast
1 stars 0 forks source link

Update PDF Parser to be used #3

Closed ChakshuGautam closed 8 months ago

ChakshuGautam commented 9 months ago
GautamR-Samagra commented 9 months ago

@ksgr5566. The pdf parser work will be tracked here

GautamR-Samagra commented 9 months ago

Ouput should look like this - link @ksgr5566

ChakshuGautam commented 9 months ago

@GautamR-Samagra to have hardcoded rules to figure out the relative hierarchy of chunks.

GautamR-Samagra commented 9 months ago

Created csv : here

Required columns :

ID | Doc ID | contentString | summaryString | summaryEmbedding | text | textEmbedding | page | Tags | semanticVersion(section index) | sectionTitle | sectionString | sectionImages | isTable | isImage | isQuoted | parentString | imageBase64 | titleOfCurrentSection | siblingChunkUp | siblingChunkDown | meta | linkedChunks

The csv above for Samagra docs cover all except : isQuoted Sibling Chunk up Sibling Chunk down linked chunks

@prtkjakhar @aashutosh-samagra use the csv as it also contains the image base64 which doesnt fit on the google sheet