Closed luckman212 closed 10 months ago
Sadly I do not have an M1 Mac. Please try this workaroud: use the -noeta switch
This will "turn off" the advancement (i.e. the myavanzamentoby1sec function) which may be the culprit. If the problem does not recur, then the error is in there, and we will investigate further
Please let me know, thanks
Thanks @fcorbelli
I added the -noeta
switch to my script. It runs automated anyway in the background, so there is nobody there to see the ETA anyway. I will let you know if I see further crashes.
Might be too soon to tell, but so far, no crashes since adding -noeta
Good, this could help very much for the debugging
I've had your .cpp locally, so I took a look for a moment (I haven't run anything): could it be that in
char* mymigliaia(int64_t i_bytes,char* i_buffer,int i_buffersize)
This one
char *p=&i_buffer[sizeof(i_buffer)-1]
should rather use i_buffersize?
could it be that in
char* mymigliaia(int64_t i_bytes,char* i_buffer,int i_buffersize)
This one
char *p=&i_buffer[sizeof(i_buffer)-1]
should rather use i_buffersize?
Specifically, I believe that the cited line is equivalent to
char *p=&i_buffer[7]
and that's improbable to be intentional.
@janko-js Nice catch. I checked that function with chatGPT and it seems to agree...
It's definitely a bug, due to a change from a vector of fixed size, for which sizeof works, to a pointer, for which it returns the size of the pointer and not of the vector
It's definitely a bug, due to a change from a vector of fixed size, for which sizeof works, to a pointer, for which it returns the size of the pointer and not of the vector
And the potential memory corruption that later can probably sometimes cause a crash (depending what's in memory before) could happen at the moment when the function processes a big enough number. That corruption doesn't have to cause an immediate crash, but a delayed one.
If you look at those lines you find my correction, against sizeof(), but in another function 😄
/// snprintf(i_buffer,sizeof(i_buffer),"%.02lf %s",mybytes,myappend[i]);
snprintf(i_buffer,i_buffersize,"%.02f %s",mybytes,myappend[i]);
Please check the attached pre-release 58_11s.zip
I wrote something here about invalid pointer, but it's actually static (I missed that detail!) so I deleted the previous post. But then the potential problem is running in multiple threads, more threads "attacking" the same static buffers, also a problem.
There is a mutex on the output, so this doesn't happen It's a choice for performance
Something like that
lock(job.mutex);
(print something)
release(job.mutex);
If you mutex, that can still potentially cost more performance then having local
char buff1[...], buff2[ ]
... migl( buff1, x ), migl( buff2, y) ...
Mutex is mandatory for console output Therefore only a single thread can call the output function migliaia
Better, the functionS migliaia No C++ strings allowed, too slow
Yep weird, but fast
Yes I'm not talking about C++ but on passing up to N buffer pointers to local buffers to a single myg function instead of having "static" buffers and N myg functions
Different functions is the fastest overall implementation, kind of "function unrolling"
Because you have to call multiple times for a single output row with printf %s Up to 5 in fact You really do not want to allocate and deallocate on stack char buffers, for (worst case) millions of times And you do not want any kind of conversions of char to strings etc
As I said it is a performance issue
Because you have to call multiple times for a single output row with printf %s Up to 5 in fact You really do not want to allocate and deallocate on stack char buffers, for (worst case) millions of times And you do not want any kind of conversions of char to strings etc
I know it differently, and I'm quite sure also in the use case there, where you use these statics, one "allocation" on the stack of a few local char[30 ] buffers in a function which is calling myprintf and a few miglias is practically zero cost operation (most of the time introducing zero new T states spent, and neither dependent on N in buff1 or buff1, buff2, buffN there, and, of course, nor on M which would be the sum of the bytes needed) and at the same time can even be less stressful for the CPU caches (a few bytes on the stack practically always in all CPU mem caches) than access of the statics (in data segment somewhere not touched since who knows when). And I agree, I would also avoid using C++ "string" classes there.
It's all measurable, and in almost any real-life code (doing complex enough work) local "plain" variables (and char [ ] are such too) are faster than statics. It also helps avoiding functions with many parameters, but simply passing pointers to "plain" structures around instead, where such structures are totally ok to be on the stack too, also practically needing "zero" time for what you refer to as "allocation".
The situation is more complex than you think The problem concerns the use, in a single printf, of multiple integer to string conversions (with dots) Something like
printf("ciao %s %s %s %s %s\n",migliaia(10),migliaia(100),migliaia(1000),migliaia(10000),migliaia(100000));
You REALLY do not want to "split" in multiple printf(), like
printf("ciao ");
migliaia(10);
printf(" ");
migliaia(100);
printf(" ");
migliaia(1000);
printf(" ");
migliaia(10000);
printf(" ");
migliaia(100000);
printf("\n");
To make a single printf you need (in this example) 5 ready strings. They can be C++ strings, in which case you will use c_str(), or null-terminated char* (C strings). Obviously the second case is preferable (much faster)
The problem now is to have "something" capable of converting integers to text. For C strings (char*) this is not possible, because you will have to allocate memory "somewhere" (e.g. on the heap), and then you will have to deallocate it (malloc/free => internal fragmentation), or on the stack, every time Like (suppose a migliaia function that get, as 2nd parameter, the output char buffer)
char string1[30];
char string2[30];
char string3[30];
char string4[30];
char string5[30];
migliaia(10,string1);
migliaia(100,string2);
migliaia(1000,string3);
migliaia(10000,string4);
migliaia(100000,string5);
printf("ciao %s %s %s %s %s\n",string1,string2,string3,string4,string5);
You cannot even return an array of characters allocated on the stack, i.e. a local variable of a function, just in your deleted post
You can, of course, create functions and helper structures, creating "your" strings, similar to C++, but at that point you might as well use std::string (too slow indeed)
This is why the various solutions have already been examined, with profiling (i.e. concrete measurement of performance), arriving at today's solution.
More precisely, in reality, there were actually different functions (function unrolling) for maximum performance but it was so bad, stylistically, that I decided to refactor along the lines (introducing the bug in question, having corrected one but not the two functions of conversion)
And no, passing structures to functions slows down. Sometimes it's not a problem, sometimes it is The "amateur" approaches, let's say from a programmer with little experience who has read about patterns and antipatterns in books, are very far from the professional approaches of those who have been programming for thirty years, and perhaps have also programmed compilers and perhaps programmed in assembly for decades You read "do not pass a lot of parameters to a function, this is bad and can make issues" True If you are programming a software with a GUI or that does almost "nothing" all the time (aka: not CPU bound)
Sometimes I'm lazy, and I don't care, to use decidedly sub-optimal systems.
Sometimes, however, I enjoy writing as it was done when programs occupied a maximum of 1000 bytes (the video memory of a Commodore 64)
You will see many apparently strange things, for less experienced programmers, such as the use of dozens and dozens of Boolean variables for switches, instead of much more elegant and compact structures (like maps, vectors or whatever you want) The reason, again, is performance
Testing a boolean variable is orders of magnitude faster than querying a complex structure. Sometimes also try to reduce the number of variables tested, to hope for an optimization in one CPU register (this is rarely achieved) by the compiler
And, within cycles repeated millions or billions of times [like in a compressor] the difference becomes noticeable
Then there are many more hard core approaches, i.e. specific optimizations for the single CPU, which are natural for an old assembler developer like me (e.g. the CPUID instruction is normally faster than an if(booleanvariable) and also reduces cache flush)
But I'm avoiding them, because CPUs are now among the strangest, and you can't assume that what is "smart" for processor X is also smart for processor Y, perhaps completely different
Short version: if you don't understand the "why" of a solution, maybe it has already been thought of, rethought, measured and optimized over time. Or, even, it is simply wrong 😄
The principle used in the example you've given, let's call that principle (1):
char s1[30], s2[30], s3[30], char s4[30], s5[30];
myf(10,s1); myf(100,s2); myf(1000,s3); myf(10000,s4); myf(100000,s5);
printf("ciao %s %s %s %s %s\n",s1,s2,s3,s4,s5);
is indeed that what I've suggested, in most real-life uses, to be faster, and in the rest of the cases, at least not slower than your "static char []" way (let's call that (2)). As I've mentioned: what you call an "allocation" in a function which already has its own stack frame doesn't cost any new cycles, as long as you still use char [ ] (and not C++ string classes) and it can be more cache friendly than your preferred "static" solution.
The function containing that printf typically already has to have its own stack frame and the instructions executed for that are identical, with or without the existence of the space on the stack reserved for these s1..s5. Additionally, in the specific example of yours, the execution times of myf and printf are already orders of magnitude higher than that stack frame maintenance, making the difference, which you say exists, probably immeasurable.
If you have actually observed the difference between (1) against (2) in the actual use, I'm curious to learn about some specific location in zpaqfranz where that can be seen in measurements. I can imagine a special example of a function which in the (2) case wouldn't need its stack frame, if the function code created due to (1) results in the need for the stack frame to be introduced by a compiler. It would be interesting to me to see an example where that actually happens and really changes the execution time (as in: "I measure ... with 1 and ... with 2). That would be a new insight for me.
The principle used in the example you've given, let's call that principle (1):
char s1[30], s2[30], s3[30], char s4[30], s5[30]; myf(10,s1); myf(100,s2); myf(1000,s3); myf(10000,s4); myf(100000,s5); printf("ciao %s %s %s %s %s\n",s1,s2,s3,s4,s5);
is indeed that what I've suggested, in most real-life uses, to be faster, and in the rest of the cases, at least not slower than your "static char []" way (let's call that (2)).
The syntactic complexity becomes enormous. That is, you have to write code upon code upon code, to get the same result in terms of performance So it is a worse choice
If you are curious to look for the "archaic" versions of zpaqfranz you will see different solutions for this problem. Frankly, I have not kept track of it, it is inside the sources
The question to be answered is: is the currently implemented solution concise (syntactically) and fast (in execution)? If the answer is yes, then it is the best tradeoff (under the circumstances)
To summarize, I'm saying for this, what I call (1):
// (1)
// works always, uses stack
char s1[30], s2[30], s3[30], s4[30], s5[30];
myf(10,s1); myf(100,s2); myf(1000,s3); myf(10000,s4); myf(100000,s5);
printf("ciao %s %s %s %s %s\n",s1,s2,s3,s4,s5);
I don't expect to be possible to measure in practical use cases to be slower (and I explained the technical reasons for that) than this:
// (2)
// needs functions with different names
inline char* myf1(int64 x)
{
static char s[ 30 ]; // uses data segment
return myf( x, s );
}
inline char* myf2(int64 x)
{
static char s[ 30 ];
return myf( x, s );
}
inline char* myf3(int64 x)
{
static char s[ 30 ];
return myf( x, s );
}
inline char* myf4(int64 x)
{
static char s[ 30 ];
return myf( x, s );
}
inline char* myf5(int64 x)
{
static char s[ 30 ];
return myf( x, s );
}
lock( ); // needs locks <-- critical before use of myfN because they depend on "static" vars
printf("ciao %s %s %s %s %s\n",
myf1(10),myf2(100),myf3(1000),myf4(10000),myf5(100000));
unlock( );
and also that (1) has the advantages of not depending on the locks used in the correct places to be able to trust the results of the output as soon as the code is called from more threads. You call the approach (1) "code upon code upon code", and I don't see it that way.
I understood you that you like the trade-off you use, and I agree that you can do what you like in your own code.
I personally consider (2) a kind of code which is has bigger potential to result in the output which can contain something that doesn't reflect the information that was supposed to be printed (due to the mentioned locking issues) but I definitely know that after informing you about that I really can't expect from you to do anything about it. Thanks.
I've also forgotten to make it fully clear: if you ever happen to have the calls to the functions containing uses of statics like this, and it happens that they are executed in parallel from different threads (i.e. the correct locks() didn't protect from that) that remains a source of potential crashes, even after the sizeof() bug is corrected. That is the main reason I've pointed to that, I really have no interest to discuss the esthetic preferences of the sources here, just what is actually executed as a consequence of something written.
Context is key During important phases (i.e., multithread compression), you cannot run the risk of having any race condition between threads, full stop. The output must be serialized, and here is the mutex This solves any possible issues at the root
On support functions (e.g., multithreaded folder scanning), on the other hand, the issue is much more relaxed You can get inexact results (partially overwritten), not crashes But there is not the slightest importance in the correctness of the data shown, they are advances info, nobody really cares You can even find on the developer forum my thread on this very situation
Short version: in critical section you must serialize the output => you must use a mutex or whatever => no possible crashes, no possible overwrite => already done
Everywhere else... who care? 😄
PS as you have seen, or maybe not, there is NOT printf() in zpaqfranz, but myprintf() Because I can easilly put a mutex right here, in myprintf(), if I really want to be sure
myprint(whatever)
{
lock( ); // needs locks <-- critical before use of myfN because they depend on "static"
printf(whatever)
unlock( );
}
myprintf("ciao");
myprintf("cucu");
But, in fact, there is already something like that (with explicit pthread mutex), just to be sure no concurrent calls
pthread_mutex_lock(&g_mylock);
if ((!flagsilent) && (!flagnoconsole))
{
setupConsole();
printf("\033[%d;0H",(int)i_tnumber+1);
restoreConsole();
}
myprintf("Thread %02d %03d s %12s: speed (%11s/s)",i_tnumber,trascorso,i_runningalgo.c_str(),tohuman3((int64_t)o_speed));
pthread_mutex_unlock(&g_mylock);
The "tohuman" functions are just like migliaia
You can get inexact results (partially overwritten), not crashes
If they aren't inside of the lock and if they use statics, there can be crashes. It's actually improbable that the result will be "just inexactness" on every CPU architecture, almost none has an identical semantics as x86 . ARM's will definitely re-order memory writes relative to other writes when x86 won't.
Read that slowly: they re-order writes. What you've learned to expect about how the code executes in multiple threads on x86 just won't help you.
It's a long topic and a lot is written about it.
Everywhere else... who care?
I don't know? I thought, you, as we are discussing this in the thread about the crash in native ARM code? And you would not write these functions to be executed without the locks in multiple threads, without having some expectations. Reading your answers:
I hope I have summarized your messages correctly. If it's so, I don't think it's realistic to create such functions (EDIT: which use static" -- to be clear, that's what I talked about all the time) and expect from them to not crash on different CPU architectures. You should familiarize yourself with the problems caused by reordering on non-x86 architectures.
Once you do, I believe you can find that my suggestion is significantly cheaper than locks, even if you manage to place them on all correct places (which is typically not trivial). Until then, I expect there will be crashes.
If you say you simply don't care about the non x86-like architectures, I can understand that too. It's your work.
Read that slowly
Read that even more slowly. There-is-mutex I repeat There-is-mutex
Yes, really, there is a g_mylock (a global lock) to be used to be sure no races during console output for non-Jidac threads
It's a very basic topic, I have writted a lot about it, more than 20 years ago Well, maybe even 30, let's say 28 😃
You should familiarize yourself with the problems caused by reordering on non-x86 architectures.
Thanks for your help. I will give it the consideration it deserves
Consider that not only did I write an entire operating system Unix-like on a non-Intel architecture (AXP) but I even designed my very own microprocessor (well, actually quite simple) and my first university paper was on Motorola 68K Let's say that I think I have a fairly precise idea of "how it works" It's my job, afterall
Short version: no risks for crashes by multithread, in zpaqfranz, (obviously excluding the bugs it is full of) Trivial problem already taken into consideration
I wrote my previous message after seeing:
You can get inexact results (partially overwritten), not crashes
Reading that lead me to believe that you aren't familiar with write reordering issues, as I can't imagine that it's possible to claim that without the locks. And with the locks I haven't expected partial overwrites.
I'm also not familiar with Motorola 68K architecture which was ever able to exhibit the Arm-like conditions for reordering, so I don't know how that's relevant.
If you already lock in all needed places, then I admit that there will be no mysterious crashes that are the consequence of your preferential use of static variables in multithreaded code.
This conversation was surely an interesting read and went way over my head. But thank you @janko-js for your insights and especially thank you @fcorbelli for fixing the crash. 🙏
Hi,
I am occasionally getting these crash reports from zpaqfranz - "Bus Error". Looks like it's crashing in a simple
printf
function? Is there any way to debug this further?Here's the most recent crash report: