AngleSharp / AngleSharp.Js

:angel: Extends AngleSharp with a .NET-based JavaScript engine.
https://anglesharp.github.io
MIT License
103 stars 22 forks source link

Loading jQuery 2.2.4 causes deadlock #63

Closed doominator42 closed 4 years ago

doominator42 commented 4 years ago

Bug Report

Prerequisites

Description

As the title says, loading jQuery 2.2.4 causes a deadlock. I'm using the following source: https://cdnjs.cloudflare.com/ajax/libs/jquery/2.2.4/jquery.js

Steps to Reproduce

var config = Configuration.Default
    .WithRequester(new HttpClientRequester())
    .WithDefaultLoader(new LoaderOptions()
    {
        IsResourceLoadingEnabled = true
    })
    .WithJs()
    .WithEventLoop());

var document = await BrowsingContext.New(config).OpenAsync(r => r.Content(@"
<!DOCTYPE html>
<html>
    <head>
        <script src=""https://cdnjs.cloudflare.com/ajax/libs/jquery/2.2.4/jquery.js""></script>
    </head>
    <body>
        The body
    </body>
</html>"));

After investigation, it turns out that the issue comes from setting innerHTML while the document is loading. Here is a simple repro:

var config = Configuration.Default
    .WithRequester(new HttpClientRequester())
    .WithDefaultLoader(new LoaderOptions()
    {
        IsResourceLoadingEnabled = true
    })
    .WithJs()
    .WithEventLoop());

var document = await BrowsingContext.New(config).OpenAsync(r => r.Content(@"
<!DOCTYPE html>
<html>
    <head>
        <script>
            var div = document.createElement('div');
            document.documentElement.appendChild(div).innerHTML = '<span></span>';
        </script>
    </head>
    <body>
        The body
    </body>
</html>"));

Expected behavior: OpenAsync should return the document with the script loaded

Actual behavior: OpenAsync is blocked and does not finish ever

Environment details: .Net Core 3.1

Possible Solution

So, after a bunch of debugging, I found the deadlock happens in AngleSharp.Html.Parser.Dom.HtmlDomBuilder.cs:L141.

After parsing the last html part of the fragment, the tokenizer returns an EndOfFile token and is passed through the following methods:

The End method then sets _waiting = _document.FinishLoadingAsync(); which is causing the deadlock. So, I think the End method should not be called when a fragment is being parsed, but I'm not quite sure where exactly to put the condition. Also, I don't know if there are other cases appart of document fragments which could cause this issue.

Here's what I changed and now jQuery loads correctly. HtmlDomBuilder.cs:L1511

case HtmlTokenType.EndOfFile:
{
+   if (IsFragmentCase)
+       return;

    CheckBodyOnClosing(token);

    if (_templateModes.Count != 0)
    {
        InTemplate(token);
    }
    else
    {
        End();
    }

    return;
}

Is my thinking correct? Also the changes are to be done in the AngleSharp repo, but I'm posting here because I tought the issue is more related to scripting. Should I open an issue in the AngleSharp repo as well or just a PR is enough?

Awesome project btw :+1:

FlorianRappl commented 4 years ago

Awesome investigation 🚀 !

I think indeed that something like that fix would be working, but I think we may rather need to look at the FinishLoadingAsync. In the fragment case there shouldn't be anything open so something is definitely fishy there.

I also fully agree that this is rather an AngleSharp issue, but I appreciate that you posted it here - since scripting is the only victim here.

I will try to spend some time tomorrow thinking about the potential root cause / a potential solution and I would post it here. I would then love to get your input or assertion / evaluation for this. Let's to fix this and make a PR for AngleSharp 0.14! :beers:

doominator42 commented 4 years ago

Thanks for your input!

You are absolutly right about the FinishLoadingAsync. At first, I tought it was using the same instance of HtmlDomBuilder for the fragment and the main document, so that the End call was closing the main document.

I looked further into FinishLoadingAsync. What happens is that it's queueing the DOMContentLoaded event and returning a task which would not be completed until the event is invoked on the event loop. The event then never fires because it's waiting for the current event to finish. Here is where the event is queued What causes FinishLoadingAsync to wait for the event to complete

I'm not sure how to fix this one. Maybe the event could be fired synchronously (direct call, without using the event loop)? I would be curious to see how actual browsers implemented this.

FlorianRappl commented 4 years ago

I'm not sure how to fix this one. Maybe the event could be fired synchronously (direct call, without using the event loop)? I would be curious to see how actual browsers implemented this.

Same here. I thought about over it. I tend to actually use the early escape of the end of file case as shown in the OP.

Haven't reached a conclusion yet (not sure about any side effects).

doominator42 commented 4 years ago

(not sure about any side effects).

I think that FinishLoadingAsync still needs to be called to wait for resources (attached references) to load.

I had some spare time and looked at chromium source and they fire the event synchronously, see source. Ironically, there is a comment right above that says it should be asynchronous but the linked bug ticket has been closed with status WontFix.

I then looked at firefox source to compare and they fire the event asynchronously. Also, I'm not 100% sure (c++ is not really my thing), but I think it is not awaited.

So I can see 3 possible solutions:

IMO, the first would be the best.

What do you think @FlorianRappl ?

FlorianRappl commented 4 years ago

Yeah I think it is either

Let's go with the synchronous one. That way we delay / block with the resources, but we don't run into a (hidden) deadlock. I think it should be best.

Should I make the PR in AngleSharp or do you want to make it (I'll approve)? It would go into the 0.14 pre version, which we could then test and (if it works) I would make sure 0.14 is released ~start of next week.

doominator42 commented 4 years ago

:+1: I'll submit the PR in the next days.

doominator42 commented 4 years ago

I saw there's already some tests for jQuery, but they are not using the event loop. That causes the events to be fired synchronously, exactly as my changes does, so that's why it was passing. I ran the tests with .WithEventLoop() and it passes when applying my changes.

Should I send another PR for the tests? Because the changes are in the other repo and tests would be in this repo, not sure if it would break things.

FlorianRappl commented 4 years ago

No its fine - send the PR for this repo, too. Since this is a AngleSharp.Js 0.14 pre-release you can reference the AngleSharp 0.14 pre-release, too.

Thanks - much appreciated! :beers:

FlorianRappl commented 4 years ago

Available in develop.