Collaboration for a Python library

Bulat-Ziganshin / FastECC

Reed-Solomon coder computing one million parity blocks at 1 GB/s. O(N*log(N)) algo employing FFT.

Apache License 2.0

408 stars 39 forks source link

Collaboration for a Python library #5

Open lrq3000 opened 7 years ago

lrq3000 commented 7 years ago

Hey there,

I came from this post you posted on StackOverflow (thank you very much!). Your project is very interesting and i would be very interested in including it in my pyFileFixity project (a sort of MultiPar but aimed at maximum robustness and MIT licensed), and potentially port it into Python (you will be co-author ofc), as your readme explanations and code comments are very clear.

Of course, the Python port would be slower, but I think it can still retain most of the speed by using linear algebra specialized libraries.

For the port, I will probably need to ask you some maths questions when I'll be implementing. Would you be interested in this collaboration? I would start after you finished implementing decoding and the public API.

In any case, I'm really looking forward the development of this project, it is VERY interesting!

lrq3000 commented 7 years ago

Also let me explain a bit more about the specificities of the pyFileFixity project:

the main goal is to protect files by generating "ecc files" (also called "parity files" in other softwares). It also provides other tools to help digital data curation such as a majority vote decoder to use multiple backups for recovery, a corruption benchmark to test robustness, etc.
to ensure maximum robustness, multiple safeguards are provided to ensure structural protection of ecc files against corruption: ecc files follow a strict scheme with markers, and the markers positions are themselves protected by a "markers ecc file", which is a file that is protecting the structure of ecc files, allowing usually full recovery of structure. In other words, pyFileFixity does not only safeguard user's files against corruption, but also ecc files themselves.
to maximize the practical recovery of files and the ratio speed-and-space/recovery, a "variable length" scheme is implemented: files headers can have more ecc symbols than later parts of the file, until the end of the file has the lowest number of ecc symbols. This scheme is a crude generic approach to maximize entropy: usually, the most important info is contained at the beginning of the file, and less important info are in the rest. See for example compressed files: the most frequent decoding symbols are saved in the beginning of the file, losing these is critical and produces systemic corruption, whereas losing symbols near the end will only lose some local part of the file's content, but not the overall content.

My plan is to not only port your algorithm but also implement this "variable scheme" by implementing the change of the number of symbols on-the-fly (as I did with Reed-Solomon codes). Do you think this is possible?

Bulat-Ziganshin commented 7 years ago

i'm close to publishing of 0.1. decoder is already developed and checked, it just need new codebase and will be pushed immediately after 0.1 release. sorry, i spend too much time improving 0.1 docs :)

0.2 can decode data, but final API will require collaborative work. I don't yet have finished vision what the API should be, so we will start with working encoding/decoding procedures with some uncomfortable API and will try to make it more convenient.

as to Python implementation - yes, i will help you to understand math behind it. Once you know Reed-Solomon codes, it's pretty straightforward study. Did you read ALL md files in the project? In particular, Overview.md tells which parts of Math you need to learn first.

Also take into account #4 as well as lhc-rs library - these libraries may be better choice for you, so for the start you need to learn pros and cons of all these libraries. Just now Wirehair is the best choice since it's already working and well-tested. I suggest you to subscribe to #1 as well.

About porting it into Python - if it does meaning for you, it's possible. The code isn't that large - ~500 LOC. It has the "perfromance core" which is multiplication in GF(p) of array by scalar as wel as few more similar operations. Unfortunately, scalar algebra libraries probably doesn't support operations over GF(p), so you will need to find such libs or make a new one (or bind existing libs that perfrom that. I know about NTL, FLINT, there is even GPU implementation of entire fast NTT procedure!!)

Bulat-Ziganshin commented 7 years ago

i've read your docs but not found - what you mean by "on-the-fly" change of protection parameters? fastecc allows to use different protection configs in different calls - f.e. you can compute 128 parity blocks from 1024 256-byte source blocks on the one call and 1024 parity blocks from 256 600-byte parity blocks on the next one. no data are need to be precomputed

lrq3000 commented 7 years ago

Sorry for the late reply, I was reading some documentation (including yours!) to update me about fft and ntt encoding.

Yes you nailed it, by "on-the-fly" I meant changing the length of the parity blocks without having to precompute again.

I am really excited by your project, I'm eagerly awaiting for the decoder to be implemented so that I can link your library in my project :-)

I am also trying to port your ntt encoding into Python, but it's not as straightforward as it seems. I have an NTT and INTT functions, and I can compute the derivative of a polynomial. I also have functions to do pretty much any mathematical operation in GF. However, I cannot wrap my head around how to properly implement encoding, can you help me with an outline of the algorithm? I will of course add you as a co-author of both the reed-solomon library and pyFileFixity project (if that's anything worth!).

lrq3000 commented 7 years ago

About other projects, yes I checked them out, but Wirehair is LDPC and I would prefer a MDS such as Reed-Solomon because I understand it a lot better and it's deterministic, and the others are O(n²) which is just like my library (of course they are faster since they are written in C or C++ but asymptotically this does not make a huge difference). There is only ShianJhengLin's proof-of-concept library that might be interesting to port but it supports only erasures for decoding, not errors, so the usefulness is limited (I'm pretty sure it's easily adaptable to errors but I don't have the math skills nor the time to study it in order to implement this).

So your library is the most promising to me, it will finally make it practical to protect huge volumes of data (until now I was limited to hundreds of megabytes, with your library it should work for hundreds of gigabytes!). BTW, will your decoder implement error decoding or only erasure for the moment?

lrq3000 commented 7 years ago

Very nice paper BTW, very nice and concise, congrats :+1: The maths are a little bit above my head but I'm going to study that. And I read you mention the error locator polynomial is very similar to the formal derivative :-) So I guess once I understand how to plug in the NTT to replace the polynomal multiplication/division, it will be straightforward to implement it in all encoding/decoding routines!

Bulat-Ziganshin commented 7 years ago

So, the current state: As you can see in README, i've reinvented previous Lin algo that works in GF(p). Their latest paper implements the same high-level algo but in GF(2^n) fields. Their own library is research-grade and somewhat slower than mine but in June Christopher Taylor developed highly-optimized version of this algo and published it as https://github.com/catid/leopard - note that it has full working decoder and faster than my library.

So now you have one more choice :smile_cat: and outcome greatly depends on whether you want to use existing library or learn Math behind it and implement it yourself. I can help you with math behind my lib, Christopher probably can help you with his earlier libraries and you can reimplement Leopard in Python by mere translation of C++ code to Python.

If you want to learn how my lib works, you need to know:

GF(p) basic arithmetics
polynomial view of Reed-Solomon encoding (as opposed to the matrix view)
using FFT/iFFT to evaluate/interpolate polynomials
various FFT algos

GF(p) arithmetics is simple and covered in the Wikipedia. FFT/iFFT is there too, but i suggest to learn that from some book, in particular FxtBook is great and free.

Essential FFT algos also covered in FxtBook, you can find more in the Wikipedia again, but they are better organized in http://www.engineeringproductivitytools.com/stuff/T0001/index.html (although it doesn't cover important MFA algo).

Finally, various views on RS are covered here and there (f.e. in Wikipedia) but i can help you here. I plan to write my own tutorial, similar to famous Plank one, but describing my lib implementation, which will cover all these topics except for FFT algos.

Bulat-Ziganshin commented 7 years ago

Very nice paper BTW, very nice and concise, congrats

do you mean my README? i never filled any papers :smiley:

Yes you nailed it, by "on-the-fly" I meant changing the length of the parity blocks without having to precompute again.

I still don't understand what you mean by precomputing. Encoder and decoder functions are single-shot - they get all input blocks and return all output blocks for an entire shard in a single call. Of course, different shards may have entirely different configurations.

So I guess once I understand how to plug in the NTT to replace the polynomal multiplication/division

https://www.google.com/search?q=fast+polynomial+multiplication https://www.google.com/search?q=fast+polynomial+division https://www.google.com/search?q=fast+polynomial+evaluation https://www.google.com/search?q=fast+polynomial+interpolation

It's how I've googled several CS courses on these topics and can point you if you need. Except for multiplication, they are somewhat complex algos which are NOT used by my library. But if you want to implement error decoding, you may need to learn these topics.

Download archive - it's very large but you can find there everything i digged on the topic :smile:

BTW, will your decoder implement error decoding or only erasure for the moment?

I have never learned how error recovery works nor plan to do it :smile:. The only fast error decoder i remember is libnoe mentioned in msg - essentially this msg lists all fast ECC libs we know. You may also find extremely useful other topics of the PAR math subforum as we discussed architecture of possible PAR3 implementation.

I also developed program similar to PAR (although haven't published it), so my opinion:

There is a little utility in error decoding. Modern media tends to lose data in large contiguous blocks so as far as you keep a single crc for 4 KB sector or so (it's only 0.1%!) you can recover twice as much bits as erasures instead of errors. Yes, error recovery may help a little when only part of the sector lost, but this is not much help when you have thousands sectors in a shard. Moreover, error recovery never fails so you may end up with improperly recovered data. Dealing with all these problems may be possible but IMHO is too much work for too little benefits, i would prefer to spend time on more practical improvements
Since FastECC operates in GF(p), it require recoding binary data prior to encoding and after decoding. This 1) makes absolutely useless any error recovery support, 2) adds one more word to each sector descriptor. The last point makes my algo somewhat non-MDS, but each sector descriptor in my program already carries about 40 bytes, so this extra overhead is not a deal-breaker. Note that Leopard/Wirehair doesn't have this issue
LDPC algo such as Wirehair doesn't make much difference to MDS algos except for a cases when you have too few blocks. With >=1000 blocks and 0.1% probability that you will need even one extra block (and even less probability that more blocks are required), situations when this single block matter, should be extremely rare. Also, Wirehair algo AFAIK is O(N) and computationally should be faster than RS algos including FastECC and Leopard, that may be important for your slower Python implementation. I also think that other differences are more important such as algorithm (in)ability to deal with more than 65536 blocks, handling of non-round configs such as 1025 blocks, and memory requirements

lrq3000 commented 7 years ago

Ah sorry I thought you were a co-author of "An Efficient (n,k) Information Dispersal Algorithm based on Fermat Number Transforms" by Sian-Jheng Lin and Wei-Ho Chung, but anyway yes your documentation is very good, thank you very much for taking the time to write that :-)

About error recovery, it is useful in case your ECC symbols can be tampered too (and not just the data). In this case, CRC can also be tampered too and you lose all ability to erasure decode, but not error decode. I will find a way to enable error decoding if at least I can find a way to encode using NTT :-)

About LDPC algo limit of 65536, this can be alleviated by using chunking so that's not a big deal, same for non-round configs which can be fixed by padding, it's similar limits to usual Reed-Solomon. But that's a totally different algorithm, LDPCs are VERY tricky to get done right, so porting it to Python is bound to be very complicated :-(

Thank you very much for the links and documents :+1: I will study that. But however about NTT for polynomial division (RS encoding), I could not find much, this is something very very new, most other papers and tutos talk about FFT for fast polynomial division (which I am trying to implement, but I am getting mixed results for now). Do you have any pointer for NTT in RS encoding beside the paper by Sian-Jheng et al?

Bulat-Ziganshin commented 7 years ago

Ah sorry I thought you were a co-author of "An Efficient (n,k) Information Dispersal Algorithm based on Fermat Number Transforms"

Do you understand the paper? If so, you are smarter than me because I got it only when rediscovered their algo myself.

I will find a way to enable error decoding if at least I can find a way to encode using NTT :-)

Do you understand O(N^2) error decoding algos? I'm not and they are more complex than erasure algos. And except for noe, i don't remember any fast error decoding algos published.

About error recovery, it is useful in case your ECC symbols can be tampered too

I think it can be managed by improved ECC file structure:

protect control data with ECC too
spread these protected data over entire ECC file, interleaving them with regular ECC data in chunks that are smaller than usual span of lost data (4 KB should be enough for modern media)
use 5x more protection for control data (f.e. if regular data are protected with 10% redundancy, use 50% redundancy for control data)

This way, when you lose chunk of control data, you usually also lose chunk of regular data, and since you have 5x more protection for control data, it will be extremely rare when you have enough regular data for error-decode but not enough ECC data to recover. It is how my own PAR-like utility works.

About LDPC algo limit of 65536, this can be alleviated by using chunking

I prefer to use 4 KB data blocks to match modern media. Use of multiple shards (f.e. 16 shards of 64K blocks instead of single shard of 1M blocks) makes overall code essentially non-MDS and decreases recovery chances. I prefer to use as much blocks in single shard as can be loaded to RAM simultaneously, and with 4 KB block and modern computers this is ~~1M blocks.

same for non-round configs which can be fixed by padding, it's similar limits to usual Reed-Solomon.

O(N^2) RS doesn't need padding and can easily support GF(2^32) operations, but using more than 64K blocks will be extremely slow anyway. With padding to 2^n blocks, when going from 32768 to 32769 blocks you immediately double memory usage and halve speed. So the solution, that can increase number of blocks in smaller steps, is extremely important when we want to encode data in as large shards as can fit in RAM.

Bulat-Ziganshin commented 7 years ago

Do you have any pointer for NTT in RS encoding beside the paper by Sian-Jheng et al?

In BulatECC.zip, these are in directories Par3 (collecting code/papers that were considered for PAR3 implementation by MultiPar author), NTT_RS and mateer-thesis.pdf. Note however that the approach i used IMHO is simplest and fastest among all algos proposed in these papers. So if you understand this paper, you don't need to look any more, except for curiosity.

But however about NTT for polynomial division (RS encoding), I could not find much, this is something very very new, most other papers and tutos talk about FFT for fast polynomial division

As i described here (and Lin et al in the paper), we don't need polynomial division at all, and even less for encoding.

NTT is the shortcut for DFT performed in a Galois field. FFT is any fast algo computing DFT. So any FFT algo is usable for computing NTT. Papers/courses talk about FFT just because it's a generic notion - these algos are applicable both for polynomials over C and polynomials over GF(p^n).

Encoding algo is described here. Do you need more info to understand it?

lrq3000 commented 7 years ago

Hey Bulat,

Thank you very much, it's an amazing database of papers you've got, thanks for sharing! Unluckily no I don't understand the whole paper yet, I'm no mathematician either, I'm just a bit familliar with good old Reed-Solomon procedures :-)

About ECC protection, it's not really necessary to protect ECC bytes because they are protecting themselves, I'm not sure that adding another ECC on ECC is actually better Signal-to-Ratio wise compared to just increasing the number of ECC bytes. But the file structure of the ECC file/PAR file can actually be protected, and this is what I did with pyFileFixity with index files :-) They protect the metadata.

About error detection, I posted on SO the Python code and the related books. The best one being: "Algebraic codes for data transmission", Blahut, Richard E., 2003, Cambridge university press.

It's really the best book about Reed-Solomon and errata decoding, it describes almost everything, and I'm pretty sure it would work with your implementation since it's simply giving the errors locator polynomial, you can then use your decoding procedure with it just like if it was the erasure locator polynomial. But that's just for your information, I understand you don't have the time to implement it :-)

About the NTT encoding, only the "Faster" section describe it, and your code. I tried to infer from the code what the algo does, can you help me filling up the missing pieces?

def encode(message, N=255): # N is the galois field size right?
    coefs = INVERSE_NTT(message, N)

    generator_polynomial = generate_poly(...)
    for g in generator_polynomial:
        ???
    return NTT(encoded_message, N)

Usually, instead of ??? I just do a gf_polynomal_division(message, generator_polynomial), but here you're telling me that there is no polynomial division, so I have a hard time understanding what is done in this part (because also you are using a per block encoding, which I guess is the 1MB cache optimization you are talking about above).

lrq3000 commented 7 years ago

Also if you want to support different parameters for your RS encoder (such as US FAA ADSB UAT RS FEC standard), I think you can change in RS.cpp:

T root_2N = GF_Root<T,P>(2*N), inv_N = GF_Inv<T,P>(N); to: T root_2N = GF_Root<T,P>(generator*N), inv_N = GF_Inv<T,P>(N);

and: T root_i = GF_Mul<T,P> (inv_N, GF_Pow<T,P>(root_2N,i)); to: T root_i = GF_Mul<T,P> (inv_N, GF_Pow<T,P>(root_2N,i+fcr));

With these two changes, you add two parameters: generator (by default 2, usually alpha in papers) and fcr (by default 0) that allow compatibility with any other RS codec. The resulting codeword will be different, but at decoding you should have the exact same decoding power (same number of erasures/errors recoverable). Then the US standard has these parameters:

gf = 256
generator = 2
primitive polynomial = 0x187
fcr = 120

Of course, at decoding, you also have to apply these changes: use the generator number to generate the generator polynomial, and fcr everytime you have to loop over the generator polynomial (more specifically: at syndrome calculation, and forney algorithm to correct errata).

Bulat-Ziganshin commented 7 years ago

About ECC protection, it's not really necessary to protect ECC bytes because they are protecting themselves, I'm not sure that adding another ECC on ECC is actually better Signal-to-Ratio wise compared to just increasing the number of ECC bytes. But the file structure of the ECC file/PAR file can actually be protected, and this is what I did with pyFileFixity with index files :-) They protect the metadata.

And I do the same. Now let's see:

If you save 8-byte CRC64 for each 4 KB sector (data and ECC), and you are 100% sure that CRC will survive, you don't need error decoding. Moreover, this allows to restore N lost data sectors from N survived ECC sectors, while error decoding will need 2N survived ECC sectors for recovery of N lost data sectors.
Now let's compare. F.e. you use 20% redundancy for regular data, and 5x more for metadata, i.e. 100%. With CRC64 protecting each sector, you need to store 5 data sector CRCs and one ECC sector CRC per each ECC sector, i.e. 6*8=48 bytes. Adding 100% redundancy, it's 96 bytes metadata stored per ECC sector - still much less than any reasonable sector size.
So, we have a choice - in order to restore a lose of 20% data sectors, we can either add 40% redundancy and rely on error-recovery algo, or add 20% redundancy, plus 96 bytes to each ECC sector stored, and use simple erasure-recovery algo. Second way is definitely more efficient if ECC sector is more than 96 bytes long.
Now we want to ensure that we can restore all metadata when at least ~80% of ECC sectors are survived. To ensure that, we use data/ECC sectors as small as usual minimal span of lost data, i.e. physical sector of actual media. It was 512 bytes in old HDD/FDD, but in modern HDD/SSD it's probably 4 KB at least. So, we encode data using 4 KB sectors and interleave metadata with ECC sectors in the ECC file. This way, each time we lose metadata record, we usually lose nearby ECC sector. So, as far as we lose less than 20% of ECC sectors, we usually lose less than 20% of metadata records, which is inside our 50% target by a wide margin.

Does it make sense now?

lrq3000 commented 7 years ago

Ok I see, but there are two differences in my implementation and the goals:

the goal is to ensure long term file fixity (over decades!), so virtually everything can (will) be corrupted, including ECC and CRC bytes.
I am encoding per symbol, not per block. If you do per symbol encoding, then using a CRC (or a hash, I think that's equivalent as you are just trying to error detect on a block) is useless, because the CRC cannot give you the location of the errors, it just tells you there is an error somewhere in the block. But if you do block encoding like PAR, then yes it is useful because one block = one symbol that can be corrected by ECC, so in this case the CRC is giving you the location, for far less storage requirements than ECC.

I will now look into block encoding as symbol encoding is clearly a dead end in terms of speed performance (maximum ~10MB, whatever the programming language). I will then come back to implementing NTT :-)

lrq3000 commented 7 years ago

Coming back after a lot of reading.

The most useful resource I could find was by the famous J. Plank: http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf

So there is no magic block coding, the matrix multiplication approach anyway requires XOR additions, which is not available in linear algebra libraries of Python, so anyway reimplementing them manually will be slow. What I am missing the most to get a fast enough codec I think is SIMD, I will look into that but I don't hope too much because it seems difficult to access this kind of accelerated instructions from Python...

Now about the scheme:

PAR1 and PAR2 use a non-systematic horizontal coding scheme: it means that it considers the input message to be the i-th byte from each file. Hence, you get k source files, and generate n-k ecc files. The i-th byte of the ecc files correspond to the i-th bytes of each k source files, and all these bytes make the input codeword. Thus, the size of the ecc files is proportional to the source files, but there is a hard limit to the number of files, being limited by the chosen Galois Field.
pyFileFixity uses a non-systematic vertical coding scheme: k designate source symbols, and not source files: each file is divided in blocks of k source bytes, and (n-k) ecc bytes are generated. In the end, we can also generate (n-k) ecc files or just one file, just like PAR.

So, having said all that, I don't think using horizontal scheme makes much of a difference performance-wise, or am I mistaken?

If it does not make such a big difference, I will just use your lib when you finish the decoding, as it is near impossible to implement SIMD optimizations in Python :-(