SchneiderInfosystems / FB4D

Open Source Library for connecting Firebase by Delphi VCL/FMX
https://www.schneider-infosys.ch/DelphiBlog/de/#FB4D
Apache License 2.0
177 stars 51 forks source link

Access violation in Delphi 10.4 asynchonous calls in THttpClient and a possible way to avoid them. #84

Closed macc2010 closed 3 years ago

macc2010 commented 3 years ago

Describe the bug When I run the DemoFB4D.dpr and click the button "Start listener" in a collection, after a undefined period of time, there is an error like Listener error Exception in listener: Invalid stream read count: 207 - 596 (C:....\Source\FB4D.FireStore.Listener.pas, line 1017. After this error, I got access violations in the THTTPClient.ExecuteHTTPInternal of System.net.HttpClient.pas, in line 1262 when is called LResponse.DoReadData(LResponse.FStream), LResponse.FStream is nil and in TWinHTTPResponse.DoReadData and access violation occurs. I have reported this problem in Embarcadero Jira : https://quality.embarcadero.com/browse/RSP-33810

To Reproduce Steps to reproduce the behavior:

  1. Go to DemoFB4D.dpr
  2. Click on button Start listener
  3. Wait until the access violation occurs ( usually, I leave running the application all night ).

Desktop (please complete the following information):

Possible solution: I have thought that as the problem is in the System.Net.HttpClient.pas that is trying to use a destroyed object, it looks as if the THTTPClient class needs the fStream created all time. If it is as I say, a possible solution could be to create the fStream at TFSListenerThread.Create and destroy fStream in TFSListenerThread.Destroy and replace the sites where fStream is created and destroyed with fStream.Size := 0 and fStream.Position := 0. What's your opinion?.

Thank you and best regards.

SchneiderInfosystems commented 3 years ago

Do you use the newest version? In the current version, line 1017 is within TFSListenerThread.StopListener at {$IFDEF ParserLog}?

I saw this kind of exception within TFSListenerThread.OnRecData at line 1066 within ss.CopyFrom(fStream, ReadCount - fReadPos); I assume we see here a sporadic race condition between writing and reading of the stream. I don't know how we can save access to the stream by using, e.g., a critical section. I am wondering what the position of the Embarcadero THTTPClient developer is? Therefore I have added a retry logic that retries the readout of the stream up two times. Do you get a reraise situation in line 1075 after two retries? I don't believe that using only one stream is more stable because if the race condition destroys the stream, my current solution is probably more stable. But I'm open to learning from more insights. BTW, I have voted for your RSP-33810.

@macc2010: Can you connect me at LinkedIn so we can discuss the details of your report to Embarcadero?

macc2010 commented 3 years ago

Sorry, I sent in the comment the log of the old sources, the new ones are :

16:04:08: Token renew passed 17:05:55: Token renew passed 17:56:38: Listener error Exception in listener: Invalid stream read count: 207 - 596 (C:\Software\Delphi\Lib\FirebaseForDelphi\Source\FB4D.FireStore.Listener.pas, line 1070) 17:56:38: Listener error Exception in event parser: Invalid packet received: [[225,["noop"]12 17:56:38: Listener error Exception in event parser: Invalid packet received: [ 18:08:55: Token renew passed 18:19:17: Listener error Exception in listener: Invalid stream read count: 202 - 596 (C:\Software\Delphi\Lib\FirebaseForDelphi\Source\FB4D.FireStore.Listener.pas, line 1070) 18:19:17: Listener error Exception in event parser: Invalid packet received: [[49,["noop"]]1 18:19:17: Listener error Exception in listener: Stream read error

I am using the lastest version of FB4D, this morning I compare it with my local copy.

This morning, I did the changes that I mentioned in my previous comment ( create fStream at the beginning of the procedure TFSListenerThread.Execute, before "while not CheckTerminated and not fStopWaiting do", added FreeAndNil(fStream) to the TFSListenerThread.Destroy, and replaced all FreeAndNil(fStream) with "fStream.Size := 0; fStream.Position := 0" ), and by now I haven't seen more access violation in System.net.HttpClient.pas.

But I have seen one access violation here :

procedure TFSListenerThread.OnEndListenerGet(const ASyncResult: IAsyncResult); var Resp: IHTTPResponse; begin try if not ASyncResult.GetIsCancelled then begin try Resp := fClient.EndAsyncHTTP(ASyncResult); // <---- fClient is nil in some circumstance ..........................

In some circumstance, TFSListenerThread.OnEndListenerGet is called twice, and the second one, fClient is nil.

It is difficult to debug, because the problem need time to raise the error.

I will try to sign up in LinkedIn and connect with you, my english spoken is not good, I do not know if LinkEdiIn is by videoconference, or by chat, if you don't mind, I prefer to connect by chat.

Thank you.

macc2010 commented 3 years ago

This is the debugger stopped before the access violation :

fclientnil

macc2010 commented 3 years ago

I have not mentioned in previous comments that I am using Delphi 10.4 update 2, but I remember the access violations also in Delphi 10.3.3 ( due to fclient = nil ).

SchneiderInfosystems commented 3 years ago

I have never seen this situation. To overcome this situation I have added a simple check. Can you check, what happens after the logline "FSListenerThread.OnEndListenerGet: aborted HTTP"?

macc2010 commented 3 years ago

To test for access violations, all that I do is to leave DemoFB4D.dpr running with the Delphi 10.4.2 IDE overnight (having configured Notify on exceptions unchecked) and when I come back in the morning I see the access violation.

After some modifications that I made yesterday in FB4D.FireStore.Listener.pas, I leave again DemoFB4D.dpr running all the night, and when I have returned this morning, I have not encountered any access violations.

Just at the moment of clicking the "Stop listener" button and exiting the program, an access violation occurred trying to access fStream with value nil. I have added code to control this situation and I think that now the code cannot raise access violations.

I give you the source that I have modified in case you want to see my possible solution. It basically consists of not creating instances of fClient or fStream and having them always available to avoid the bugs in Delphi standard libraries.

FB4D.FireStore.Listener.zip

macc2010 commented 3 years ago

I have seen that in THttpClient there must be another bug, relationed with certificates and with the destroy of THttpClient and the async way that uses ( I had another access violations on THTTPClient.DoNeedClientCertificate, called by THTTPClient.ExecuteHTTPInternal ). Because of that, my possible solution would be to have also the fClient created before the execution of the thread and not to destroy it until the code have exited from the while that is in the execute method of the thread.

macc2010 commented 3 years ago

Now, I have left running the DemoFB4D.dpr for a long time within the Delphi 10.4 IDE with the IDE Events window opened to see the OutputDebugString messages.

I am using this modified FB4D.FireStore.Listener.pas that has your last changes and the mine to avoid the access violations :

FB4D.FireStore.Listener.zip

I will send you the results incluiding the Event messages.

Thank you.

macc2010 commented 3 years ago

As I didn't want to test first my modifications, I have downloaded the lastest version of FB4D project and I have rebuild the project and I have left DemoFB4D.dpr running listening for collection changes and after one hour, I have seen the access violation in the same site that I have informed to EMB in this link : https://quality.embarcadero.com/browse/RSP-33810 ( writing to the stream ).

Now I am going to test my solution, I will inform you about the result.

I have attached the :

event window log.zip

With the lines copied from the Event windows from the Delphi 10.4.2. IDE.

Thank you.

SchneiderInfosystems commented 3 years ago

Your event log shows no unexpected defect nor the Access Violation. Is this the log based on your modifications?

macc2010 commented 3 years ago

Sorry, no, I have not copied the correct event log. :-(

I will prepare another test.

Thank you.

macc2010 commented 3 years ago

I have tested again, and now I have the event log with the access violation :

event window log.zip

When the access violation is raised, I have clicked the break button, and after this I have showed the call stack window and the content is :

System.Net.HttpClient.THTTPClient.ExecuteHTTPInternal(???,$936C000,TWinHTTPResponse($A177B90) as IHTTPResponse) System.Net.HttpClient.THTTPClient.ExecuteHTTP(???,???,TWinHTTPResponse($A177B90) as IHTTPResponse) System.Net.HttpClient.THTTPClient.InternalExecuteAsync$ActRec.$0$Body System.Net.URLClient.TURLResponse.AsyncDispatch System.Classes.TBaseAsyncResult.DoAsyncDispatch System.Net.URLClient.TURLResponse.Schedule$ActRec.$0$Body System.Threading.TTask.Execute System.Threading.TTask.InternalExecute($A1DF5E0) System.Threading.TTask.InternalWork(???) System.Threading.TTask.ExecuteWork System.Threading.TThreadPool.TQueueWorkerThread.ExecuteWorkItem(TTask($A1DF62C) as TThreadPool.IThreadPoolWorkItem) System.Threading.TThreadPool.TQueueWorkerThread.Execute System.Classes.ThreadProc($A0E7300) System.ThreadWrapper($54BBD60)

In System.Net.HttpClient.THTTPClient.ExecuteHTTPInternal I have seen the problem here :

bug

This test has been done with the source code that is in github without nothing modification to it.

However, in previous test, I saw that the memo hangs with a lot of lines ( TMemo of FMX application DEMO.dpr ) filled here :

procedure TfmxFirebaseDemo.OnFSRequestError(const RequestID, ErrMsg: string); begin memScanFS.Lines.Add(TimeToStr(now) + ': Listener error ' + ErrMsg); end;

So I commented that line to run as this :

procedure TfmxFirebaseDemo.OnFSRequestError(const RequestID, ErrMsg: string); begin //memScanFS.Lines.Add(TimeToStr(now) + ': Listener error ' + ErrMsg); end;

To avoid any problem with the Firemonkey framework ( for example when a TMemo has a lot of lines and it hangs ).

It has been the only change that I have done.

Thank you.

SchneiderInfosystems commented 3 years ago

I looked at your latest change. From my point of view, this solution bears the risk that TMemoryStream and THTTPClient are no longer continuously rebuilt in case of an error and thus an error can propagate without any chance of recovery.

If you can show that your solution runs better overall, I'm happy to be convinced.

I am collecting with FSSimpleChat for over a month's long-term running experience under Windows64/32 and Mac64 as well as Android64. Do you test only under Win64?

macc2010 commented 3 years ago

I looked at your latest change. From my point of view, this solution bears the risk that TMemoryStream and THTTPClient are no longer continuously rebuilt in case of an error and thus an error can propagate without any chance of recovery.

In the case ot TMemoryStream, I think that is a class very tested and if you put its size to 0 and the position to 0, is the same that destroying it and creating again, I do not know what problem can produce it. In the case of THTTPClient, I have showed you more than one access violation, for example in the destroying the internal certificatelist.

If you can show that your solution runs better overall, I'm happy to be convinced.

Every night, I am testing DemoFB4D.dpr, and this night with my solution has been the only night that I have not seen any access violation, and when I have comeback to the office this morning, I have changed the collection in Firestore and the DemoFB4D.dpr has showed the changes in the TMemo, so I think that it works. The only problem has been that when I have closed the application, the Delphi IDE has shown a memory leak, but I think that is a problem in Delphi net libraries.

Days before, I tested without having to create TMemoryStream and creating and destroying THTTPClient and the program finally raised exceptions. After that, I decided to test with the current modifications ( without having to create and destroy TMemoryStream and THTTPClient ) and the result is that I have just commented previously

I am collecting with FSSimpleChat for over a month's long-term running experience under Windows64/32 and Mac64 as well as Android64. Do you test only under Win64?

I am testing only with Win32 in DemoFB4D.dpr.

By the way, I have tried to use the FSSimpleChat demo and it did'nt work for me, never show me the next form after clicking in "Register display name" and no exception is raised.

Have you tried to test the DemoFB4D.dpr only for a night?. I have the problem in a few hours ( sometimes in only one hour ).

I do not know if it can be the difference, but I am listening for a change in a subcollection, not in the main collection.

Thank you.

macc2010 commented 3 years ago

Hello,

I was suspecting that the problem only was going to occur with the FMX framework, so I developed a small vcl application to test the firestore listener. I run it, and after working during 5 hours, it has not raised any access violation nor memory leaks.

My intention is to use FB4D it in a VCL application, so I will stop my tests here, now, I know that it works well in a VCL application Win32 that is my target.

Thank you very much by your atention and I am sorry by the wasted time.

macc2010 commented 3 years ago

As a detail that I have seen, I propose you a small change in the FB4D.FireStore.Listener.pas.

The current code is :

procedure TFSListenerThread.OnEndListenerGet(const ASyncResult: IAsyncResult); begin try ........... FreeAndNil(fStream); if assigned(fGetFinishedEvent) then fGetFinishedEvent.SetEvent except on e: ENetException do begin ............................ FreeAndNil(fStream); fGetFinishedEvent.SetEvent; end; .............................. end; end;

I propose you to add the if assgned(fGetFinishedEvent) also in the except to be :

procedure TFSListenerThread.OnEndListenerGet(const ASyncResult: IAsyncResult); begin try ........... FreeAndNil(fStream); if assigned(fGetFinishedEvent) then fGetFinishedEvent.SetEvent except on e: ENetException do begin ............................ FreeAndNil(fStream); if assigned(fGetFinishedEvent) then fGetFinishedEvent.SetEvent; end; .............................. end; end;

Because if a problem could to be in the try, also in the except could to be.

Thank you.

macc2010 commented 3 years ago

Hello,

I was suspecting that the problem only was going to occur with the FMX framework, so I developed a small vcl application to test the firestore listener. I run it, and after working during 5 hours, it has not raised any access violation nor memory leaks.

My intention is to use FB4D it in a VCL application, so I will stop my tests here, now, I know that it works well in a VCL application Win32 that is my target.

Thank you very much by your atention and I am sorry by the wasted time.

Sorry, I have not commented that in this test, I am using the original library without modifications, and no access violation has been raised, I am not using my modified library.

Thank you.

macc2010 commented 3 years ago

Hello,

I am afraid that the problem persists, this morning I arrive and seen the same access violations after a few hours running. I want to implement an exception logger and execute this night without the Delphi IDE to see if the problem persists. I will do 2 tests, one with my modifications and the other with the original source from github. I will inform of my progression.

Thank you.

SchneiderInfosystems commented 3 years ago

Unfortunately, this was to be expected. The VCL can bring other time frameworks. An existing race condition in THTTPClient, which is identical under FMX and VCL, may show up more often or less often as a result. But it will not dissolve. I will also run further tests to narrow down this problem.

Did you prepare the Firebase according to the instructions? https://github.com/SchneiderInfosystems/FB4D/wiki/Firestore-Chat-Application

macc2010 commented 3 years ago

Did you prepare the Firebase according to the instructions? https://github.com/SchneiderInfosystems/FB4D/wiki/Firestore-Chat-Application

I think that I have less restriction, I have this in Firestore rules :

rules_version = '2'; service cloud.firestore { match /databases/{database}/documents { match /{document=**} { allow read, write: if true; } } }

And this in Storage rules :

rules_version = '2'; service firebase.storage { match /b/{bucket}/o { match /{allPaths=**} { allow read, write: if true; } } }

Thank you.

SchneiderInfosystems commented 3 years ago

@macc2010: Pls check my last improvements. Can you confirm that the AV is now gone?

macc2010 commented 3 years ago

Hello,

I will test the solution and inform you.

Thank you.

macc2010 commented 3 years ago

Hello,

I have seen the modifications and it seems that now, the asserts in TFSListenerThread have been removed in a very elegant way.

But I think that all the problem with the access violations is in the TFSListenerThread.StopListener.

I have seen that the thread does not wait to the execute procedure to finish, it is waiting for a timeout ( with 500 ms as maximum wait ), so in any race condition the StopListener finish without that the execute procedure have been exited and at continuation, the destroy of the TFSListenerThread is produced, being freed fGetFinishedEvent and the processmessages can inclusive cause that the Application object be freed and any access violation can occur at any site ( because of that I have seen access violation accessing fGetFinishedEvent ), included when you ask in the source code by Application.Terminated, I have had access violation in the TFirebaseHelpers.AppIsTerminated when Application.Terminated is going to be executed because, probably, Application object is nil, so this take me to think that the TFSListenerThread.StopListener have terminated prematurally and the application have terminated but the thread not.

This condition can occur, and while it is not solved, I think that we are going to have access violations.

I suspect that the timeout exists because Delphi 10.3 does not cancel the BeginGet asynchronously, but in Delphi 10.4 this was solved, so, why not to control this condition and remove the timeout in Delphi 10.4?, or remove it in all Delphi versions?. I think that the condition of the timeout in the TFSListenerThread.StopListener is not a good idea, the thread can not be freed until the execute procedure have finished, it is a basic in a TThread class, I think.

What's your opinion about it?.

Thank you.

macc2010 commented 3 years ago

@macc2010: Pls check my last improvements. Can you confirm that the AV is now gone?

The access violation persists, writing to the stream :

FB4D_FirestoreListener_Accessviolation

macc2010 commented 3 years ago

To test the problem, I have left my delphi application running all night ( runned from the Delphi IDE ) with the option "Notify on exceptions" to false. When I arrive in the morning, I can see an access violation and the debug Delphi IDE box with the buttons "Break", "Continue", and I have pressed "Break" and the Delphi editor has showed me the screen that I have sent to you.

The solution that I did propose to you does work. My final sources are :

Source files modified.zip

macc2010 commented 3 years ago

After the access violation, I have had to terminate my application using Run -> Program reset ( Ctrl + F2 ) because the program did not finalize the execution.

macc2010 commented 3 years ago

Hello,

With the current sources from github, I have access violations with Delphi 10.4.2 running in a VCL target windows 32 bits.

I have tested my modifications and after hours running ( 7 hours ), I have not had any access violation, nor running nor finalizating the application.

My final sources are :

Source files modified.zip

One important point is to ask by if Terminated in the TFSListenerThread.OnEndListenerGet ( see FB4D.FireStore.Listener.pas ), because, sometimes, the callback event OnEndListenerGet that is provided in the fClient.BeginGet, is called when the thread TFSListenerThread has exited from the execute procedure, I suspect that it is an issue from Delphi THTTPClient class, that when it is destroyed, the request or the response is not synchronous destroyed and the callback is finally called in an asynchronous way, even if the object of the caller have been destroyed ( in our case the TFSListenerThread ).

Thank you.

macc2010 commented 3 years ago

This night I have done another test for 7 hours ( with the sources provided previously ) and I have not had any access violation, nor running nor finalizating the application. After 7 hours, I have done a modification in a document and the listener has detected it. After that I have closed the application and none access violation, it has terminated ok.

SchneiderInfosystems commented 3 years ago

I have reviewed your last changes deliverd in "source files modified.zip".

  1. As already explained, I do not like to extend the life time of the THTTPClient and TMemoryStream objects to entire lifetime of the thread, because in case of an exception while accessing the stream, in your solution these classes will never be restart.
  2. I also realized that under De 10.4. issues with stopping the HTTP request have now been resolved. Still, a timeout never hurts in this situation from my point of view. In the past, my application got stuck in this loop and would not close. To prevent this I introduced this timeout. I want the application to close under all circumstances. Whether memory leaks occur when closing the application is no longer important in this situation.
  3. The only idea I've picked up from your solution is the fast terminatation within OnEndListenerGet and OnRecData. This measure can effectively speed up the closing of the application while receiving long messages while application shutdown.

So I'm closing this case now knowing that it doesn't completely fix all the underlying problems in HTTPClient. But I think I found a good workaround where the listener always reconnects to the server after disconnections and always detects new changes in the firestore. It is also important for me to emphasize that the solution must work on all platforms and on mobile devices but also on Windows Tablet PCs. It is also important that WLAN and internet interruptions must be detected and handled correctly.

Feel free to report new upcomming issues if you see, that the current solution does not cover the above requirements.

macc2010 commented 3 years ago

I understand your point of view with points 1, 2 and 3, are logical.

I will test it and inform to you the results.

Thank you very much.

macc2010 commented 3 years ago

I have left this night the program running ( with the sources of github ) and when I have comeback to the office this morning I have seen an access violation in the THttpClient trying to write to the stream, so the problem has not been still resolved.

I can imagine that I am not the only one person who can have this problem.

As I have tested my modifications, and I am still testing the modifications before production use, and as I am implicated in to send to you any error that I see in this modifications, I propose to you to have a conditional compiler definition to resolve this problem.

Can you see what is your opinion about this source code? :

Source files modified 20210519.zip

As you can see, all the applications that would use this source code would run exactly like they have done until now, but if you remove the dot to the {.$DEFINE FIRESTORELISTENER_WINDOWS_AV_FIX} conditional define, the fix that I have tested in Windows projects would be applied.

I suppose that it would be useful for someone who would have the same problems that I have had, and the solution to this, only would suppose to remove the dot of the new conditional define.

My solution is only tested in Windows projects ( fmx and vcl ) and in Delphi 10.4.2, so because of that I have added these conditions before the {.$DEFINE}.

macc2010 commented 3 years ago

I confirm you that my fix is working without access violations and receiving correct data from Firestore, it is stable.

I send you the last sources, that are the same than "Source files modified 20210519.zip" but correcting {$IFDEF WINDOWS} by {$IFDEF MSWINDOWS}.

Source files modified 20210521.zip

SchneiderInfosystems commented 3 years ago

Can you check if the AV occurs in the small period, where in my solution the Stream and/or the client is nil? For this purpose you need to log the creation and destroying in my solution. If you can show me, that your seen AV is in the period between destroy and re-creation I will made more deeper investigation. Thank you!