Open KMayank29 opened 1 month ago
Hi @KMayank29 - thanks for reporting. To clarify, does partition_html(url=url)
or partition_html(text=response.text)
give the correct output?
@KMayank29 sounds like an encoding error. What is the type of text
when you pass it in?
If it is bytes
and the HTML doesn't include an encoding declaration you'll want to decode it to str
before passing in. Something like:
html_text = html_bytes.decode("utf-8")
You'll need to work out the encoding for your case, it wouldn't necessarily be "utf-8".
Hi @KMayank29 - thanks for reporting. To clarify, does
partition_html(url=url)
orpartition_html(text=response.text)
give the correct output?
pertition_html(url=url)
gives the correct output. partition_html(text=response.text)
outputs only 2 or 3 sentences and only two types of element.
@KMayank29 sounds like an encoding error. What is the type of
text
when you pass it in?If it is
bytes
and the HTML doesn't include an encoding declaration you'll want to decode it tostr
before passing in. Something like:html_text = html_bytes.decode("utf-8")
You'll need to work out the encoding for your case, it wouldn't necessarily be "utf-8".
I pass the str
type into partition_html(text=response.text)
.
import requests
url = "https://www.geeksforgeeks.org/difference-between-compiler-and-interpreter/"
response = requests.get(url)
type(response.text)
# str
Any update?
Bug Description
When I pass the
url
topartition_html
it outputs correct. However when I pass text it extracts bad content. I believe there is some bug in the source code in relation to whentext
is passed in the argument rather thanurl
. I also tried with the source code and it is working fine. The entire code has been shared below.Code snippet