R3BRootGroup / R3BRoot

Framework for Simulations and Data Analysis of R3B Experiment
https://github.com/R3BRootGroup/R3BRoot/wiki
GNU General Public License v3.0
18 stars 104 forks source link

R3BUcesbSource2 and add runtime struct info checking #910

Closed YanzhaoW closed 10 months ago

YanzhaoW commented 1 year ago

List of features:

Motivation:

R3BUcesbSource crashes the program when running NeuLAND lmd2cal task due to the double closing:

https://github.com/R3BRootGroup/R3BRoot/blob/71a9f2e0f8431b4188561462f6785cf0f605d13b/r3bsource/base/R3BUcesbSource.cxx#L51-L60

As has been discussed in #712 and #852 with @inkdot7 and @klenze, it would be better if the ext headers of the latest R3BRoot version could be compatible and used for the old experiments, without risking the "not-found" situations.

With the need to add this new feature and correct the existing malicious coding practices in the ucesb source class, the change of the old ucesb source class could result in 90% of its code to be rewritten. It's almost a certain that some people would be against it considering the new experiments coming very soon next year. Therefore, it's better to create a new class named "R3BUcesbSource2".

Testing result:

It's been tested with NeuLAND lmd data, both from s509 and s444, using the same ext header. For the s444, the number of neuland dp must be set to 8. Setting a higher number causes the program to terminate because of the unfound items (which should happen). However, the bar number value from s444 is somehow not correct. it could be caused by larger array sizes (causing "array_fewer" flag) of the struct defined in its sever side (need further investigation).

The implementation of runtime struct item checking of NeuLAND can be seen here in PR #890.

The implementation from the ucesb source side can be found in R3BUcesbStructInfo class.

If you have any suggestions, please comment below.


Checklist:

inkdot7 commented 1 year ago

R3BUcesbSource crashes the program when running NeuLAND lmd2cal task due to the double closing:

That could and should be repaired?

YanzhaoW commented 1 year ago

Hi, thanks for the comment.

That could and should be repaired?

The problem is the other closing from FairRoot is also not deterministic neither. Sometimes it closes and sometimes it doesn't. Instead of simply removing the calling of the close method, the best way, in my opinion, is to just let Close do nothing and handle the resource closing directly using RAII. That's also the reason why the close method of the new class is empty.

inkdot7 commented 1 year ago

The problem is the other closing from FairRoot is also not deterministic neither. Sometimes it closes and sometimes it doesn't.

That sounds like a bug that should be reported.

YanzhaoW commented 1 year ago

That sounds like a bug that should be reported.

Yes, I have reported to FairRoot devs few days ago. Closings on both FairSource and FairSink have a lot of problems. They are also trying to fix them. See this pr and this issue.

klenze commented 1 year ago

My hot take on destructors: I generally believe that cleaning up after things which are not likely to be instantiated within the event loop should be a low priority.

I so often read code in which people decide to just decide to call delete on any member variables which are (raw) pointers with no regard for ownership considerations. For Ucesb source, the only thing we might check is that the unpacker process is indeed dead. But I would hope that in any case, either processing stopped because the unpacker quit (out of data or error) or that it would try to continue to send data through the broken pipe which will also kill it.

Also, I see that you are already compromising with regard to using clear ownership semantics. Returning a raw pointer to a unique_ptr does not seem 100% kosher. Perhaps a shared_ptr would be more appropriate?

(Again, I do not think this matters at all. If no R3BReaders were ever deleted, that would also be an ok outcome for me.)

YanzhaoW commented 1 year ago

Hi @klenze. Thanks for the comment

Also, I see that you are already compromising with regard to using clear ownership semantics. Returning a raw pointer to a unique_ptr does not seem 100% kosher. Perhaps a shared_ptr would be more appropriate?

I am not sure which case do you refer to exactly. Could you elaborate on the "compromising with regard to using clear ownership semantics"?

Regarding to returning a raw pointer from unique_ptr, it's actually a good practice to do that as long as there is no ownership transfer. I would say one of the biggest misunderstanding of C++ smart pointers is to use them everywhere. Old raw pointer is still perfect to use in non-owning situations.

In my experience, I very rarely use shared_ptr except when dealing with concurrency. In a synchronous programming, with a good design, somehow it can always determine the life-time of an object during the compile time and assign it to an owner, for which unique_ptr is enough. Shared_ptr is only needed when there is no way to determine the life-time of an object, such as the situation where an object is shared by multiple threads and you basically can't tell which thread will close first and which will close last (The one closes last should call the dtor of the object).

YanzhaoW commented 1 year ago

Just one notice: The compile time error from the sofia folder is caused by rootcint as rootcint doesn't use the included path in cmake targets. See this discussion.

YanzhaoW commented 1 year ago

Found ghost process from ucesb server unclosed. Under investigation.

inkdot7 commented 1 year ago

Found ghost process from ucesb server unclosed. Under investigation.

Yikes. I thought that was repaired again in ucesb. (There was a problem where SIGPIPE got ignored; thus, the process did not die when an output pipe got closed for whatever reason at the other end (also crashes or Ctrl-C).) That should be these commits:

https://git.chalmers.se/expsubphys/ucesb/-/commit/7cc495a7757e40b1713e09ea2791110153b482a8 https://git.chalmers.se/expsubphys/ucesb/-/commit/62c42c7ed97d3e0f9cdd8cb4e7c0da1a660c89d8

This is the 'last defence' against ghosts. It is, of course, nice if one tries to close it more deliberately as well.

YanzhaoW commented 1 year ago

I'm using the updated version. Would the server process die if it gets a SIGKILL signal?

inkdot7 commented 1 year ago

I'm using the updated version. Would the server process die if it gets a SIGKILL signal?

I think nothing should be able to survive a SIGKILL. Only case that comes to mind is some filesystem hang. But then at least it would reasonably not use CPU any longer. Is it using CPU? Or if it is a zombie process.

YanzhaoW commented 1 year ago

It's a zombie process. I found it because of some undeletable .nfs files with running processes that use no CPU power.

inkdot7 commented 1 year ago

It's a zombie process. I found it because of some undeletable .nfs files with running processes that use no CPU power.

If it is a zombie, then its parent should still be alive, but have not yet asked for the zombie's exit status. If parent not alive, zombies are made children of pid 0, which typically asks for the exit status, to get rid of the zombies.

YanzhaoW commented 1 year ago

I somehow can't reproduce this zombie process now. Maybe they were created before I added terminate for child process.

If parent not alive, zombies are made children of pid 0, which typically asks for the exit status, to get rid of the zombies.

Is that so for ucesb server? Normally an orphaned process will be assign to the init process (with PID 1) when its parent dies. How does ucesb handle this?

inkdot7 commented 1 year ago

If parent not alive, zombies are made children of pid 0, which typically asks for the exit status, to get rid of the zombies.

Is that so for ucesb server? Normally an orphaned process will be assign to the init process (with PID 1) when its parent dies. How does ucesb handle this?

My bad, I should have looked it up... As you say, PID 1, not 0. So described the sam as you: what the system does with zombies, ucesb cannot change that behaviour (only do its part by wait()ing for the process (get its exit status)).

Was the zombie a ucesb itself (i.e. the unpacker, which would be a child of r3broot), or was it the ext_... something (ext_struct_writer)? That latter would be a child of the unpacker process. The data flows from the unpacker to ext_... to r3broot in this case.

YanzhaoW commented 1 year ago

I'm not sure. But I saw "Unexpected end of file from external writer." printout with yellow background when I killed the ghost processes by their PIDs.

But now I can't reproduce the ghost processes anymore. Even when I abort the R3BRoot in the middle of the run (Both close and terminate are not called), I still can't find any .nfs files leftover.

They may have existed for a long time when I run with old versions of ucesb.

YanzhaoW commented 10 months ago

Completed and ready for further reviews. If there is no other issues, we could merge it in one or two days.

YanzhaoW commented 10 months ago

Hi, @jose-luis-rs , it's ready to be merged.

jose-luis-rs commented 10 months ago

Thanks @YanzhaoW, could you also do a PR on the fast branch? Just to have the same code on both branches.

YanzhaoW commented 10 months ago

Sure, it seems that Github can't do this automatically. By the way, what's the plan for the "fast branch"? Will you merge it back to dev branch after the experiment? If not, I don't need to worry about possible merge conflicts.

jose-luis-rs commented 10 months ago

The idea is to delete the "fast branch" after the experiments and for this reason one must do the same PR in both branches, if the PR is accepted in the "fast branch" but proposed for improvements in the dev branch, the corresponding author should take care of it as soon as possible. Anyhow, I think that the best is to do the same PR in both branches without changing the "title", just to avoid conflicts.