Open fortyfourforty opened 4 months ago
I'm not sure what happens here but this is odd indeed. Note that if you can use a web archive to reproduce the errors later.
In general, duplicated elements can be easily tackled by using the integrated deduplication filters and setting the right threshold.
sorry, I forgot about archive.is. Noted.
I don't think using deduplicate = True is a valid workaround as there are some pages that do have extact same text segments on the same page.
@fortyfourforty The integrated deduplication does prevent identical text segments on the same page.
hi,
I was setting a test site and playing with trafilatura and found a weird bug.
site URL:
https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/
as this test site is only available for 2 days, so I also attached the simple Gutenberg block code below for you to replicateCommand:
the Wordpress Gutenberg htmls below
It is very simple extraction but I find some elements are extracted twice. elements below "this is sample intro" appeared twice but not all of the elements appear twice. some of the list elements only show up once.
See the extraction below: