dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
649 stars 123 forks source link

Could ChatGPT be used to write site specific parsers? #1128

Open dteviot opened 7 months ago

dteviot commented 7 months ago

ChatGPT (and Google's Bard, and similar) can apparently write code for tasks. So, could it be persuaded to write a parsers for specific web sites?

gamebeaker commented 7 months ago

@dteviot my attempt with bard failed.

dteviot commented 7 months ago

@gamebeaker Any notes? What you tried? Result? Just saying "Failed" is hard to learn much from.

gamebeaker commented 7 months ago

here is a link to what i tried. https://g.co/bard/share/fb1b5954cece

dteviot commented 7 months ago

Hmm....

Thanks.

Kiradien commented 7 months ago

AI's may have better luck with a more generic parser, with less actual code being written; no idea if it will actually be helpful but I've used it a few times on and off, with it, the AI would theoretically just pass the parameters for instantiation instead of writing code.

To be clear, the code needs cleanup and doesn't currently handle paging. It is possible to implement paging as additional parameters; I lost that changeset a while back and haven't used this in a while. GenericParser.txt

That said, if this generic parser were actually perfect, I'd have submitted it ages ago.

dteviot commented 7 months ago

Trying to be more specific didn't work.

Request:

Provide javascript that, given the URL of a story's Table of Contents, can extract the author, for stories on the site https://kakuyomu.jp, a typical Table to Contents page is https://kakuyomu.jp/works/1177354054894027232

Response:

As a language model, I'm not able to assist you with that.

Request:

Page https://kakuyomu.jp/works/1177354054894027232 is the table of contents for a web novel. How can I find the Author?

Response:

Unfortunately, I couldn't find the author information on that page.

(Note, author appears in 3 places on the page.)

Mathnerd314 commented 6 months ago

You have to remember that language models process text, not URLs. The support for external requests is rather hacky and generally it works by downloading the page and including the text for the model to read (but in a special format). I tried manually including the HTML in ChatGPT 3.5 and Bard and it is too long to include in the prompt. In Bard I also tried including a placeholder like <full html of https://kakuyomu.jp/works/1177354054894027232>; it does get it to download the page but I think the downloader preprocesses the page to plain text because the CSS selectors it output (.widget-user, .user-name) bore no relationship to the HTML of the page.

So what is necessary is a model that supports input sizes of 360KB (the page's html size). CodeLLama in theory supports 100K token context windows, which might be enough (?), but it isn't available as a convenient webservice so I haven't tried it. Also I found some posts like "100k tokens is a meme". OTOH Claude 2 is open beta, 200k tokens, and you can just attach files directly. So I went with that. Even with the huge limit, the HTML file was ~55% too large, I think part of it is that the free token limit is less than the paid but also HTML files are just huge. I cut it down by removing the JSON slug at the end and trimming out some CSS files, JS files, and SVG paths.

Now the question: did it work?

Request:

I have provided the HTML code of a sample story's Table of Contents page on the site https://kakuyomu.jp/. Please provide JavaScript code that can extract the author from similarly-structured pages on the site https://kakuyomu.jp/. I would recommend finding the correct element using a CSS selector or XPath query but you may use whatever logic is necessary. Write it as the implementation of an extractAuthor(dom) JavaScript function, where dom is the result of (await HttpClient.wrapFetch(url)).responseXML. The author's name for this sample page is 羽田宇佐.

First I tried it without author's name, it gave

  const titleElement = dom.querySelector('h1.Heading_heading__lQ85n');
  const authorElement = titleElement.querySelector('.WorkTitle_workLabelAuthor__Kxy5E');

The title is correct but the author selector matches the list of authors in related works (not author of this work), and limited to the title element matches nothing.

Then I added the author's name. After trying several times and getting capacity limited I eventually got through and it suggested this code

  const authorLink = dom.querySelector('a[href="/users/hanedausa"]');

This works for the page, but obviously not in general.

Concluding, it does at least seem to be reading the page, but it has relatively little understanding of the DOM. None of the selectors involved parent-child relationships or anything like that. The LLMs are clearly struggling to parse the raw HTML, both in capacity limits and in understanding.

Short answer: no, it doesn't work, the model got confused.

I would say, to properly work, you need a multi-step structured process, that doesn't involve feeding LLMs raw HTML. Render a picture of the page, run an image model over it to identify elements such as title, author, etc., translate the coordinates back into DOM elements, and then use a second model (or just heuristics) to guess CSS selectors given the concrete DOM path.

dteviot commented 6 months ago

@Mathnerd314 Thanks for that. (And that's a lot more effort than I was willing to put into it.) To be honest, I was not expecting this to work.
But, given the hype about LLMs, I thought I'd give it a go.

That said, when I tried some other stuff, I was somewhat surprised by the response to "How can I convert a Web Novel to an epub?" I got three methods back. The 3rd was to use WebToEpub.

Mathnerd314 commented 6 months ago

Well, the annoying thing is that it "almost" works. It generates code, it generates CSS selectors in that code, and the CSS selectors even match some things in the HTML. They are just the wrong things. It suggests that if you tripled the size of the model and fine-tuned it on some examples it might actually work. So really it is a hardware problem. I would say if you don't want to try the image recognition route, then just shelve it and try it again in like 3-4 years when every new computer comes with a dedicated AI chip and the models have gotten better.

Mathnerd314 commented 6 months ago

But also, there are DOM-aware approaches:

They aren't off-the-shelf though (yet) so it is probably overkill compared to the heuristics you have now and the simple expedient of manually specifying the CSS selectors, until someone releases a library that you can just start using.