adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.23k stars 239 forks source link

Preserve horizontal space in code blocks #553

Open mittsommer opened 3 months ago

mittsommer commented 3 months ago

Hello, thanks for yours continous work on trafilatura recent when we using trafilatura working on code-text content extraction, wo noticed that the santize func remove all white space \ table even in code block when using txt outpput formating we think the problem is here preserve_space=False in default https://github.com/adbar/trafilatura/blob/2c9f20296c1c5ce9a23715a07df5b623f3016b65/trafilatura/xml.py#L315C5-L315C51

adbar commented 3 months ago

Do you mean space before the code or space in general? Could you provide a concrete example of code block?

mittsommer commented 3 months ago

Guten Tag, thank you for your replay we are working on output article with code inside in markdown formating, here is an example

这样在当前目录下就能够生成demo的api服务了。 下图为生成的项目目录结构: 在logic下面的demologic.go编写逻辑

func (l *DemoLogic) Demo(req *types.Request) (resp *types.Response, err error) {
// todo: add your logic here and delete this line
return &types.Response{
Message: "hello world",
}, nil
}

in this case, all white space before the code line in the code block were removed, which is unexpected and not friendly for LLM training

btw. here is another bug (maybe) when extracting inline code block, a redundant '\n' was added after a inline code block now result

1.2、实现WebMvcConfigurer

接口,注册拦截器 which is supposed to be

1.2、实现WebMvcConfigurer接口,注册拦截器

thank you

adbar commented 3 months ago

Yes, spacing is not necessarily preserved in code blocks, this can be improved.