DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

How to give HTML code as a string #368

Closed jaswanth-13 closed 3 days ago

jaswanth-13 commented 3 days ago

Question

i want to give html code as a string , but when i am trying it is giving me error

import requests
from docling.document_converter import DocumentConverter
doc_converter = DocumentConverter()
html_content = request.get('https://en.wikipedia.org/wiki/Cricket').text
docling_doc = doc_converter.convert(html_content)

this is giving error

OSError: [Errno 36] File name too long: '<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>India - Wikipedia</title>\n<script>  ..........
dolfim-ibm commented 3 days ago

You have two options:

  1. Simply provide the url to the convert() method. It will be downloaded for you.
  2. Wrap the content of the file as a binary stream. See https://ds4sd.github.io/docling/usage/#convert-from-binary-pdf-streams.