crosstype / node-html-markdown

Fast HTML to markdown converter for NodeJS or the browser
163 stars 28 forks source link

Ignore option not working as expected #49

Open mattkauffman23 opened 1 year ago

mattkauffman23 commented 1 year ago

I'm trying to convert an html page to a markdown document ignoring any nav tags. I'm calling translate as follows:

import { NodeHtmlMarkdown } from "node-html-markdown";

NodeHtmlMarkdown.translate(
  htmlString, { ignore: ["nav"] }
);

I'm using this page to test with: https://python.langchain.com/en/latest/modules/prompts/output_parsers/examples/retry.html. My expectation is that all the content contained by the nav element would be skipped in the generated markdown, but it's being included.

AnonC0DER commented 1 year ago

Hello, I have the same issue, none of the options work for me. It just ignores them all. Does anyone have a solution?

tnraro commented 8 months ago

nav is defined in defaultBlockElements.

https://github.com/crosstype/node-html-markdown/blob/06a5501523c474f7ae708640d98e4aeabcd67e9b/src/config.ts#L11-L16

The constructor of NodeHtmlMarkdown is defined as follows, so the ignored element translators are overwritten.

https://github.com/crosstype/node-html-markdown/blob/06a5501523c474f7ae708640d98e4aeabcd67e9b/src/main.ts#L40-L49

Workaround: change the order of translator setup. (For convenience, I recommend using pnpm patch, yarn patch, etc.)

blockElements?.forEach(el => {
  this.translators.set(el, { surroundingNewlines: 2 });
  this.codeBlockTranslators.set(el, { surroundingNewlines: 2 });
});

ignoredElements?.forEach(el => {
  this.translators.set(el, { ignore: true, recurse: false });
  this.codeBlockTranslators.set(el, { ignore: true, recurse: false });
})
jasonbarry commented 6 days ago

I wonder how this test passes, then? @mattkauffman23, have you tried putting the tag name in uppercase, like

{ ignore: ["NAV"] }
mattkauffman23 commented 6 days ago

I wonder how this test passes, then? @mattkauffman23, have you tried putting the tag name in uppercase, like

{ ignore: ["NAV"] }

I don't recall. Haven't worked on the project where I was using this package in quite a while.