KSP-CKAN / CKAN

The Comprehensive Kerbal Archive Network
https://forum.kerbalspaceprogram.com/index.php?/topic/197082-*
Other
1.99k stars 348 forks source link

HttpClient times out for some users while trying to get etags #3950

Closed dandoesgithub closed 11 months ago

dandoesgithub commented 11 months ago

Is there an existing issue for this?

Operating System

Windows 10

CKAN Version

1.34.0

Games

KSP 1

Game Version

KSP 1.12.5

Did you make any manual changes to your game folder (i.e., not via CKAN)?

Yes, I deleted the mods inside of GameData

Describe the bug

CKAN gets stuck during "Refreshing game version data", then after a while errors. This only happens with one instance. The solution for this bug for now is searching a mod in the search bar, then refreshing

I expected it to load normally image

Steps to reproduce

Step 1: Load an instance Step 2: Refresh Step 3: Get lucky (or unlucky) and get the error

Relevant log output

No response

HebaruSan commented 11 months ago

Hi @dandoesgithub, thanks for the report.

This only happens with one instance.

To fix this we will need to know what is special about that instance. Have you perhaps added some custom repositories to it?

dandoesgithub commented 11 months ago

No, I haven't, it's the default repository

HebaruSan commented 11 months ago

OK, is there anything else about that instance that you can think of that could explain why this doesn't happen for me or anyone else up to this point?

dandoesgithub commented 11 months ago

Well, I deleted all the mods in the GameData folder (CKAN doesn't know that), but I don't think that's what's causing the problem. I'm not sure why it could be happening, as it was working in 1.33.2. I'll try forgetting the instance then reloading it in CKAN

dandoesgithub commented 11 months ago

I could take a look at the CKAN files in the particular instance

JonnyOThan commented 11 months ago

@dandoesgithub It might help to zip up your CKAN folder from that instance and post it here.

dandoesgithub commented 11 months ago

slight problem: i have deleted that instance. craaaap. i'll try to recover it with a program but i don't think i'll succeed

HebaruSan commented 11 months ago

Oh, in that case we won't be able to investigate this. Let us know if you do recover the instance and we can re-open it.

dandoesgithub commented 11 months ago

wait: the new instance that i made a few days ago (to replace the broken one) has the same problem. i can send that one instead.

dandoesgithub commented 11 months ago

no, every other instance works except for this one. either it actually is my PC, or it's the instance

anyway, here's the CKAN file CKAN.zip

HebaruSan commented 11 months ago

Thanks, that ZIP looks fine to me. Maybe @JonnyOThan had something particular in mind.

If it's having the error during "Refreshing game version data", that gets stored in another place. Can you send us %LOCALAPPDATA%\CKAN as well? You can exclude the downloads subfolder if it's huge.

Are you using the same ckan.exe file for all your instances, or are you copying it into each instance? Usually that doesn't matter, but I'm trying to think of other things that might differentiate instances.

dandoesgithub commented 11 months ago

CKAN.zip, i am using the same ckan.exe for all my instances.

dandoesgithub commented 11 months ago

should i try deleting the ksp-default repo then adding it again?

HebaruSan commented 11 months ago

Thanks, that looks all-in-order as well. I would not expect deleting and re-adding the repo to help, but I wouldn't discourage you from trying it if you think otherwise.

The next thing to try is probably capturing some logs so we can get more information about what CKAN is doing when this problem happens. If you unzip this log4net.xml file into the directory where your ckan.exe file lives and then run CKAN, it should load the settings automatically and generate a CKAN.log file that you can share here.

log4net.xml.zip

<?xml version="1.0" encoding="utf-8" ?>
<log4net>
  <appender name="CkanFile" type="log4net.Appender.RollingFileAppender">
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Finding latest available" />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Installable: " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Already enlisted" />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Available: " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Adding new available module " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Adding available module " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Adding to available module: " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Copying " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Writing file " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Making directory " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Making parent directory " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Not registering " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Got owners: " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Module parsed: " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Reading CKAN data from " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Skipping archive entry " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <regexToMatch value="Checking compat of" />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Got contents" />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Not removing directory " />
        <acceptOnMatch value="false" />
    </filter>
    <filter type="log4net.Filter.StringMatchFilter">
        <stringToMatch value="Removing " />
        <acceptOnMatch value="false" />
    </filter>

    <file value="./CKAN.log" />
    <appendToFile value="false" />
    <rollingStyle value="Once" />
    <maxSizeRollBackups value="10" />
    <lockingModel type="log4net.Appender.FileAppender+MinimalLock" />

    <layout type="log4net.Layout.PatternLayout">
      <conversionPattern value="%6timestamp %-5level %logger  %message%newline" />
    </layout>
  </appender>
  <!-- Log everything by default -->
  <root>
    <level value="DEBUG" />
    <appender-ref ref="CkanFile" />
  </root>
  <!-- Turn down some of the noisier loggers -->
  <logger name="CKAN.CompatibilitySorter">
      <level value="INFO" />
  </logger>
  <logger name="CKAN.NetFileCache">
      <level value="INFO" />
  </logger>
  <logger name="CKAN.CkanModule">
      <level value="INFO" />
  </logger>
</log4net>
Falki-git commented 11 months ago

Just confirming that I'm getting this problem as well. But sometimes, now always. I'll try with the logging @HebaruSan suggested.

image

Falki-git commented 11 months ago

Here is the log after one of these hangs. CKAN.log

HebaruSan commented 11 months ago

Thanks, @Falki-git. That's very strange; that log file shows CKAN completing and exiting normally. If it's the file that you meant to send, then that may mean we don't have logging set up for wherever this problem is happening.

"One or more errors occurred" isn't even our text; it's fromAggregateException. So some problem is happening and then the framework is obscuring it with that class. The documentation says it's used extensively by PLINQ, so this is probably related to the recent changes to parallelize various things for performance. (Or the root cause could be unrelated, but now happening in parallel and therefore obscured by PLINQ.)

Here's a test build with a first attempt to get the original problem from the AggregateException based on its documentation; please try it if you have time and are willing to trust that I'm not sending you malware:

(I did build that on Ubuntu, so the risk of malware should be very low anyway.)

Falki-git commented 11 months ago

@HebaruSan, I tried it and I get the same behavior. I can refresh several times without a problem and then it hangs on the next refresh. Hang lasts for about a minutes and then the "task canceled" is thrown.

Here are the logs from this attempt.

CKAN_logs.zip

And just a quick video of what's happening.

https://github.com/KSP-CKAN/CKAN/assets/72734856/1efcbdb2-a378-4674-b56d-a3c346d01f59

HebaruSan commented 11 months ago

Yeah, you're supposed to get the same behavior, all I added was handling of the AggregateException. That's why it now says "A task was canceled" instead of "One or more errors occurred".

The log has basically the same info. I.e., almost none:

107407 ERROR CKAN.GUI.Main  A task was canceled.
System.Threading.Tasks.TaskCanceledException: A task was canceled.

So, that's... great.

dandoesgithub commented 11 months ago

Deleting and re-adding the repo worked! apparently the repo was the problem!

HebaruSan commented 11 months ago

One more attempt to get useful info out of this exception:

@Falki-git or @dandoesgithub or anyone else who can reproduce this, please try that build and let us know what it says.

dandoesgithub commented 11 months ago

oh, the issue is still happening on that build, deleting the default repo and re-adding it worked again

HebaruSan commented 11 months ago

Deleting the repo and re-adding it doesn't actually change anything, though. I think this is a race condition that has turned out differently after you've done that by coincidence.

Some support pages are saying HttpClient can throw this as a timeout, which on the surface makes sense since we do now use that to get the etags, but that should take over a minute and a half to time out, and that video is less than 15 seconds:

image

HebaruSan commented 11 months ago

Also @dandoesgithub the point of that build was to gather info. Can you please share a screenshot or the logs?

dandoesgithub commented 11 months ago

where are the logs located?

HebaruSan commented 11 months ago

I should say explicitly: The test builds that I have been sharing are not attempting to fix the problem, because we do not know what the problem is. They are solely trying to get enough information to identify what the problem is. Normally I would do this myself on my own system, but since the problem does not happen for me, I have to go through whomever can reproduce it.

@dandoesgithub see above re: logs. https://github.com/KSP-CKAN/CKAN/issues/3950#issuecomment-1859219946

dandoesgithub commented 11 months ago

CKAN.txt

HebaruSan commented 11 months ago

Thanks; that looks just like Falki's log, so either you're still using the previous build, or TaskCanceledException doesn't provide any InnerExceptions.

147790 ERROR CKAN.GUI.Main  A task was canceled.
System.Threading.Tasks.TaskCanceledException: A task was canceled.

Side note, the 100-second default timeout seems to be confirmed by inspecting an HttpClient object:

csharp> LoadAssembly("System.Net.Http");
csharp> var hc = new System.Net.Http.HttpClient();
csharp> hc.Timeout;
00:01:40
Falki-git commented 11 months ago

@HebaruSan, here is the log for the newest ckan.exe you sent: CKAN_log2.zip

Regarding the video, I should have been explicit also, I cut it down to 15 seconds so you don't need to look at a static screen, but from clicking Refresh to the timeout it takes a minute or so (likely 100 secs as you said).

P.S. I didn't try to delete the repo yet

HebaruSan commented 11 months ago

@dandoesgithub and @Falki-git, here's a new test build for you to try:

This one does new things: first it prints the value of HttpClient.Timeout, and then it catches, logs, and ignores any exceptions it throws while trying to get the etags. This might help with the problem you're having, but whether it does or not, it's important that we get the log file so we can determine the right long term fix for the real builds.

Regarding the video, I should have been explicit also, I cut it down to 15 seconds so you don't need to look at a static screen, but from clicking Refresh to the timeout it takes a minute or so (likely 100 secs as you said).

Thanks, that makes a lot more sense! That makes me a lot more confident concluding that this is in fact HttpClient timing out. I built and typed the above before you posted this, but I think it's still worth capturing those logs.

Falki-git commented 11 months ago

And here's the log. It's more verbose this time.

CKAN_logs3.zip

HebaruSan commented 11 months ago

And here's the log.

Thanks! Yup, HttpClient.SendAsync is definitely throwing.

 31572 INFO  CKAN.Net  HttpClient.Timeout is 00:01:40
131575 WARN  CKAN.Net  Failed to get etag for https://github.com/KSP-CKAN/KSP2-CKAN-meta/archive/main.tar.gz
System.AggregateException: One or more errors occurred. ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at CKAN.Net.CurrentETag(Uri url)
---> (Inner Exception #0) System.Threading.Tasks.TaskCanceledException: A task was canceled.<---

It's more verbose this time.

But it looks like it continued and finished the update this time?

Falki-git commented 11 months ago

Yes, it completed the update

HebaruSan commented 11 months ago

Alright, based on that and the fact that the previous implementation seemingly never had this problem, I have to consider the possibility that HttpClient just doesn't work very well (despite the fact that it has a compiler warning recommending its use). To provide a point of comparison, here's a fresh build with the old pre-HttpClient implementation of CurrentETag:

If that build works for you, @Falki-git and @dandoesgithub, without a 100-second delay, then we'll probably need to roll back the switch to HttpClient. It would be nicer to keep up with the latest and greatest library code, but if HttpClient is just a hopelessly buggy mess, then we can't reasonably use it.

dandoesgithub commented 11 months ago

@dandoesgithub and @Falki-git, here's a new test build for you to try:

* (file removed)

This one does new things: first it prints the value of HttpClient.Timeout, and then it catches, logs, and ignores any exceptions it throws while trying to get the etags. This might help with the problem you're having, but whether it does or not, it's important that we get the log file so we can determine the right long term fix for the real builds.

Regarding the video, I should have been explicit also, I cut it down to 15 seconds so you don't need to look at a static screen, but from clicking Refresh to the timeout it takes a minute or so (likely 100 secs as you said).

Thanks, that makes a lot more sense! That makes me a lot more confident concluding that this is in fact HttpClient timing out. I built and typed the above before you posted this, but I think it's still worth capturing those logs.

sorry for the late reply, just noticed you sent a new ckan.exe. i'll send the logs of the one you sent before, then the logs of the one rn CKAN.log

dandoesgithub commented 11 months ago

well, the build that you sent just now definitely works when refreshing my instance. CKAN.log

HebaruSan commented 11 months ago

Thanks, that's one vote solidly in favor of reverting to WebRequest, then. If Falki sees the same thing, HttpClient will be escorted from the premises.

Falki-git commented 11 months ago

Yes, that's a lot better. I cannot reproduce the timeout anymore. Tried multiple instances, refreshed many times, it works 100% now. CKAN_logs4.zip

HebaruSan commented 11 months ago

Hi @Dominiquini, this issue is not about a back and forth between me and you, it is about HttpClient throwing a TaskCanceledException after a bogus timeout that gets wrapped in a fromAggregateException and reported as "One or more errors occurred". I have locked the conversation and deleted the off-topic comments. Thanks for filing the duplicate bug report.