AdamWhiteHat / GNFS

A complete, proof-of-concept, C# implementation of the General Number Field Sieve algorithm for factoring very large semi-prime numbers. The focus was on readability and understandability of the code, not performance.
GNU General Public License v3.0
56 stars 13 forks source link

Biggest factor #5

Closed Manuel3115 closed 4 years ago

Manuel3115 commented 5 years ago

Hey !

First of all great job on this project. I can tell you really did your research and put a lot of time and effort into this project. I want applaud you for that.

I was curious and wanted to ask you if the project is completely finished. If so, what is the biggest number you managed to factor using this algorithm and how long did it take ? I'm really fascinated with prime numbers and factors but I have nowhere near the knowledge to construct such a complicated algorithm. Good job !

AdamWhiteHat commented 5 years ago

Good question. The largest number I have factored thus far is a number of about 21 digits long. This number is paltry in comparison to what I believe its current limit to be and in no way representative of what this implementation is currently capable of, I just have not incrementally tested larger and larger numbers just to see what its capable of. Theoretically, I am not aware of anything that puts a limit on the size of number this thing can EVENTUALLY factor. Because of various inefficiencies, the long time it might take to actually achieve factorizations of large numbers places a practical limit on what it can factor, but I wouldn't know where to draw that line and it should always be going up as I replace certain bits with a more complicated but more efficient algorithm.

My goal is for this application to be able to at least factor RSA-100, which is a 100 digit, 330 bit number. At the time of this writing, I have not yet run an RSA-100 factorization to completion.

There are two issues:

1) During sieving, distinct relations are "sieved" and the result of this sieving is that the relation is deemed to either be a "smooth number" or a "rough number", based on whether its norm's prime factorization consists of all prime numbers that a smaller than some bound. This implementation keeps the smooth relations, as these are what we want and ultimately bring us closer to a successful factorization. It also tracks the rough relations as well, because it is possible to squeeze a few more 'smooth' relations out of the rough relations. However, it currently does not have this, and while it does occasionally flush this list to disk and then clears it, at some point that list gets too big between writes to disk that it throws an OutOfMemoryException building the string that it is going to write out to disk. What this does, then, is to effectively crash the application during any long-running operations. There are several ways to solve this, but at this point, there is just no advantage to holding on to these and it uses up tens of gigabytes of disk space during larger factorizations and I say it should just be commented out for now. If you look at Line #161 in PolyRelationsSieveProgress.cs, the contents of that whole else branch can be just be commented out. The reason I have not done this and checked it in yet is that I am in the middle of rewriting the sieving portion to work against an interface so that several different strategies for picking the (A,B) pairs to be trial sieved can be written and swapped out at runtime. Your interest in my project by asking this question has gotten me motivated to work on this project some once again, and I may just pull down another clone of this project and comment this out for you and everyone's benefit because the fact that this has not been done and is still an issue is a bit ridiculous.

2) This application does not have the functionality to search for 'good' polynomials at this time. In this algorithm, an 'algebraic polynomial' needs to be selected. The choice of polynomial affects how quickly smooth relations are found in later steps as well as the total number of relations you will be able to find. This makes the selection of your polynomial of the utmost importance and is the largest single factor affecting total factoring time. Most implementations will search for a polynomial for you, using a metric called Murphy's alpha function and possibly doing a little trial sieving to get an estimate of the density of smooth relation that that polynomial produces. This implementation is still barely beyond the proof of concept stage, and as such requires that the choice of polynomial to use has be supplied to it, either from a white paper that is discussing a successful factorization of that particular number and tells you the polynomial they used or from some other GNFS factoring program that does have a polynomial search routine, such as YAFU or mseive.

Given you comment out the offending code in #1 and you know a good starting polynomial to supply to the application, in theory, it should be able to factor RSA-100 or a semi-prime number of similar size, although this has not been actually done, and so other issues may be found that must be resolved before this is actually achievable.

I hope that answers your question.

Manuel3115 commented 5 years ago

Thank you for your answer ! I find it surprising that, even though it has been used over a decade ago to factor numbers such as RSA160, almost nobody in the world has been able to make a working GNFS algorithm publically available. I have to applaud you for being one of the first to make a semi-functional one. I guess it is explained by the sheer difficulty in understanding and writing such an algorithm. It is also very surprising that factoring numbers, something we learn to do in middle school, starts to have so much dept as you take bigger and bigger numbers.

Anyways, if you do decide to keep working on this project, I will be sure to follow your progress. I do have one last question for you. What is your background ? In what field did you study to be able to write a GNFS algorithm ? Did you learn everything from the books you read or did you have some kind of knowledge prior to that ?

Thank you for your answer.

AdamWhiteHat commented 4 years ago

Well, there is msieve https://sourceforge.net/projects/msieve which is written in C. It's written with performance in mind, so that means the readability suffers a bit. The only other GNFS implementation that I am aware of is GGNFS https://sourceforge.net/projects/ggnfs. Again, same deal regarding performance and readability.

It does get deep, but at the same time, its still all centered around the core idea of Dixon’s method, that is finding a congruence of squares, which was published in 1981. And, in turn, finding a congruence of squares is an improvement of the difference of two squares, which dates back to Fermat, ~1650 A.D.

In essence, we are still trying to find a congruence of two squares. The strategy is to build up squares using smooth numbers that satisfy several conditions related to polynomials which express the number to be factored. However, as the number to be factored grows large, so too does the polynomial and all the numbers involved. As number get bigger, the chances of them being smooth becomes vanishingly small. In order to keep the party going, some clever people (Pollard, Pomerance, Lenstra?) figured out that we can use abstract algebra to work in a number field (essentially where all numbers are reduced modulus some number, which effectively creates an upper bound for which all numbers we work with will be smaller than) to construct a what is equivalent to a square (it is the remainder of some square, modulo our modulus) and that satisfies all of the other constraints that we need it to, Once we have that , we can (hopefully) recover the actual square, who's value is unknown to us up until this point, from its remainder. Pretty wild. The devil is in the details, of course, but that's the gist.

Regarding your last last paragraph of questions, I do not have any formal degrees from a college or anything, if that's what you are asking. I'm self-taught, I don't learn well in a classroom environment, and I learn best when I'm teaching myself and learning it because I want to learn it or have an immediate use/need for the knowledge. I only went to college for 2 years before I dropped out and decided to write software as my career. My declared major was chemistry at the time, and I really didn't appreciate math the way I do now. After about 5 years doing development I entered a phase of my life I call my 'math renaissance' where I started to really appreciate math and enjoy it.

I pretty much learned the GNFS from the various whitepapers/journal papers that are available on the subject, of which there are not too many. Those that which I found the most helpful, I have checked in to the GNFS repository. Some of the papers available are very heavy on the notation, and taking lots of hand written notes on what the notation meant was critical. Basically, most of the algorithm was not that hard to implement or follow along, and it was only the very last step, the finding of a square roots of a polynomial over a finite field that kicked my ass. Everything up to this point is essentially the same as the quadratic sieve, which they say you should do first before attempting the GNFS, and boy that's great advice, I should have taken it. It is this last part that invokes all the number theory and abstract algebra. After struggling with it for a month or two, I realized I just wasn't going to bumble my way through and that I really needed to understand what was going on and that would require learning abstract algebra. So I stopped and took a 10 month break from the project to teach myself abstract algebra. It was something I wanted to learn anyways, so I was pretty motivated in this regard.

One of the resources that was the most helpful for me to learn abstract algebra was this series of YouTube videos by a maths professor named Matthew Salomone. Here is a link to his video series "Exploring Abstract Algebra II": https://www.youtube.com/playlist?list=PLL0ATV5XYF8DTGAPKRPtYa4E8rOLcw88y

Taking lots of notes and building a few of these groups/algebra systems in code really helped make things concrete in my head. Really, you just need to see lots and lots of examples of these things for you to visualize what is going on and for it to stick in your head. Cayley tables are like multiplication tables, but for groups and some rings. Making Cayley tables helped me visualize what was going on and really start to wrap my mind around these concepts.

I also lurked on https://www.mersenneforum.org a lot too. While I didn't get a lot from reading the msieve source code directly, the guy who wrote it, Jason P., wrote several posts about what he had done and what steps he had taken/written with his implementation as he worked through various road blocks, and that was very helpful.

Manuel3115 commented 4 years ago

Thank you for your reply. I really appreciate it !