Why add pilot symbols in the WDM system over SSFM example?

Rassibassi commented 10 months ago

some questions were asked here: https://github.com/Rassibassi/claude/commit/a5c74a66a8f7d2a43f95430308b582a5e6709d3b#commitcomment-136146827

I'm creating a new issue for the latest question, such that they become more visible.

The last question by FrankTian0012 was:

Hi Rasmus~ Thanks for helping me understand this code. I am reading your example of constellation shaping across the SSFM via end-to-end learning

I have some other questions. Why you set pilot symbols in this simulation? What are the pilot symbols for? Also why the parameters nPilots=128, pilotOrder = 4? Thank you very much again for keeping helping me. Really appreciates it!

Rassibassi commented 10 months ago

Great question, this was at the very end of me experimenting with this model before finishing off my PhD, so things are not 100% complete, but let me start by talking about the 4 different versions of the iPython notebook you are linking to.

examples/tf_wdmSystem.ipynb Link: Simulation of a WDM system with the SSFM method using the tensorflow implementation. SSFM stepsize is large and nonlinear effects are set to zero. This ipynb was used to make sure that the implementation was correct in a logical manner (so no programming bugs left)
examples/tf_wdmSystem.html Link as rendered HTML: Same as 1., however including nonlinear effects. The simulation was run on a cluster using a GPU, output was then rendered as html. It was run as follows:
```
jupyter nbconvert --to html --execute tf_wdmSystem.ipynb
```
examples/tf_wdmSystem-learning.ipynb Link: Doing backprob and thus training for a constellation, or much rather pseudo-training as the nonlinear effects are not modeled.
examples/tf_wdmSystem-learning.html Link as rendered HTML: Same as 3., training via backprob but including nonlinear effects, the training was run on a cluster using a GPU, output was rendered as html, with above command, but running tf_wdmSystem-learning.ipynb instead.

Let me first say, this repository here is a demo repository and I had another one for my actual research, which was messier, but where I had all my python code in .py files instead of notebooks, making it easier to run on a cluster. I distilled this demo repository in order to have a cleaned up version for the public. In hindsight, for the WDM system examples the notebook format actually does not make any sense, as they are compute and memory intensive, which forced me to use jupyter nbconvert.

I hope the differences in the 4 examples above are clear. In examples 2 and 4, the nonlinearity of the fibar channel introduces a static phase rotation, no such rotation is introduced in example 1 and 3. In the simulation incl nonlinear effects, example 2, we are simulating a standard 64-QAM, and the example applies cfh.staticPhaseRotationCompensation which compensates for the static phase rotation blindly, it compensates the static phase rotation such that the 64-QAM constellation is upright. The function cfh.testPhases then tests all 4 different "uprights" by minimizing the symbol error over the 4 possibilities. One could have also calculated the static phase rotation via the cross-correlation of txSymbols and rxSymbols, and taking the average of that (or something along these lines), however, back then I thought the blind phase rotation compensation algorithm (by one of my peers) was quite interesting to play around with.

Now, in example 4, the neural network at the receiver takes the static phase rotation into account, as it just learns the received rotated constellation of the rxSymbols, leading to a correct estimated Mutual Information from the neural network output probabilities, nnMI. However, the estimated Mutual information under Gaussian Channel assumption gaussMI is incorrect as the static phase rotation is not corrected for. This also leads to its bad reported performance, meaning the green x in the last plot of example 4, and pasted here for the readers convenience.

performance

I never got to it, but I wanted to correct for the static phase rotation during the training of example 4, such that the gaussMI estimate would be calculated correctly throughout training. Following that thought, the cfh.staticPhaseRotationCompensation assumes a QAM, which I then planned to accommodate by introducing nPilot pilot symbols, from a pilotOrder=4-QAM constellation, such that throughout training the static phase rotation would be calculated on those pilot symbols and used for compensation on the neural network generated received symbols.

Somewhat lengthy answer, but hope it makes sense. Let me know if you have any follow up questions.

FrankTian0012 commented 10 months ago

Thank you for your answer! It helps a lot. But based on your answer, I have some other questions: Why not just do cfh.staticPhaseRotationCompensation and cfh.testPhases on the original symbols? And what is the significance of the estimated Mutual information under Gaussian Channel assumption gaussMI that needs to be calculated correctly? Doesn't Gaussian Channel assumption not exist in SSFM and what is the difference between it and nnMI?

Rassibassi commented 10 months ago

Before I delve into the specifics, I want to highlight that these topics are quite extensive and multifaceted. My explanation will provide a high-level overview and will touch on the key aspects, but it might not cover all the intricate details or the full scope of available research. And of course, feel free to ask follow-up questions if you need more clarity on any part.

Why not just do `cfh.staticPhaseRotationCompensation` and `cfh.testPhases` on the original symbols?

Carrier phase estimation algorithms often presuppose a QAM constellation, such as seen in Viterbi-Viterbi carrier phase estimation methods. My understanding is that the carrier phase estimation in cfh.staticPhaseRotationCompensation is more effective with QAM constellations. Therefore, incorporating QAM pilots could leverage this efficiency. Notably, during training, the neural network's generated constellation - the basis for the original symbols - evolves with each training iteration. This variability can be problematic for downstream carrier phase estimation algorithms that rely on the statistical and geometrical consistency of the constellation. However, it's important to note that in this context, we're compensating for static phase rotation, not dynamic phase rotation like a random walk, suggesting that more refined solutions might exist.

In developing cfh.staticPhaseRotationCompensation, I explored whether TensorFlow's auto-differentiation could handle gradient propagation through the eigenvalue/eigenvector decomposition it relies on. To my recollection, it can. Modifying cfh.testPhases to employ the Gumbel-max trick instead of tf.argmin could render both cfh.staticPhaseRotationCompensation and cfh.testPhases fully differentiable, akin to the approach in [1].

[1] https://ieeexplore.ieee.org/abstract/document/10093964

What is the significance of the estimated Mutual information under Gaussian Channel assumption `gaussMI` that needs to be calculated correctly? Doesn't Gaussian Channel assumption not exist in SSFM and what is the difference between it and `nnMI`?

The estimation of Mutual Information (MI) or, more precisely, the Achievable Information Rate (AIR), commonly hinges on a Gaussian channel assumption in academic circles. This assumption, while not entirely accurate for fiber channels, could potentially be valuable for a comparative analysis.

Although the Gaussian channel assumption doesn’t directly apply to fiber channels, it's useful to note that under any channel assumption, the real MI is lower-bounded by the MI estimated with a mismatched receiver based on that assumption. This means that even if the channel isn't Gaussian, using a Gaussian channel assumption for the receiver provides a conservative estimate of the actual MI. This concept underpins the term "achievable" in AIR. For a deeper dive into this, I recommend Tobias Fehenberger’s papers and Gerhard Kramer’s lecture notes from TUM (search for "Information Theory Lecture Notes Kramer"). I'd be happy to provide more detailed references if you're interested in exploring this further.

The key difference between nnMI and gaussMI lies in their respective approaches to calculating P(X|Y) in the channel model Y=X+N. nnMI utilizes a decoder neural network, with its softmax layer producing a probability vector as output. Conversely, gaussMI employs the Gaussian assumption to derive a similar probability vector. My thesis (see section 3.4 at [3]) briefly touches on the intuitive distinctions between these approaches. In short, nnMI is inherently biased towards the training samples, particularly those from the last iteration. Under an AWGN channel with infinite training samples (infinite training samples per batch!), the neural network decoder should theoretically align with a Gaussian-channel-assumed receiver. In contrast, for a fiber channel, the neural network decoder should converge towards a different model. Yet, most accurate fiber channel models (like EGN or the NLIN model) use Gaussian approximations to represent nonlinear channel effects. This suggests that in fiber scenarios, the neural network decoder should converge towards a slightly different model than a Gaussian-channel-assumed receiver. This "slightly" different model, which the neural network decoder should ideally converge towards in fiber scenarios, is precisely the one that would define the nonlinear Shannon limit, if we had a comprehensive understanding and formulation of it.

[3] https://backend.orbit.dtu.dk/ws/portalfiles/portal/178555155/rasmusJonesThesis_v12_revised_print.pdf

FrankTian0012 commented 10 months ago

Thank you very much for your clear answers and they help me a lot to understand the whole system. Really appreciate it and thank you! Now I have some other questions relating to examples/tf_wdmSystem.html Link as rendered HTML. First one is I see errorrate during the training process is 0.98 which is not nice. Does it mean the result of learned model is not significant? The other question is that the effSNR during training process is negative. Why is that? But during the performance evaluation, the code uses a GN model to calculate the effSNR, why we should do that this way instead of directly using the previous effSNR?

Rassibassi commented 9 months ago

It seems there has been a confusion regarding the version of the system you were examining. The appropriate version, which includes active training, can be found here:

examples/tf_wdmSystem-learning.html View the rendered HTML

In this specified version, the training metrics reported are as follows:

epoch: 0150 - xentropy: 0.6471 - errorrate: 0.9987 - gaussMI: -0.7214 - nnMI: 5.0665 - effSNR: -5.4651

These particular metrics - errorrate, gaussMI, and effSNR - indeed signal suboptimal performance. The primary reason for this is the omission of static phase rotation compensation during the training. It is a mistake. If static phase rotation compensation had been applied, we would expect to see significantly improved metrics, with the effSNR moving into the positive range, aligning more closely with the nnMI performance. The nnMI is not affected by static phase rotation since the decoder neural network learns to account for this as part of the channel characteristics.

See my answer above where I state:

However, the estimated Mutual information under Gaussian Channel assumption gaussMI is incorrect as the static phase rotation is not corrected for.

For performance evaluation, the training is confined to a specific launch power, providing only a single data point for analysis. To contextualize this data point against a broader range of conditions and offer a visually coherent comparison, we utilize the NLIN model to calculate the effSNR and thereof the MI for the learned constellation and also a QAM constellation. This allows us to plot a smooth line representing the model's performance for the learned constellation, also showing that the training setting, in particular the launch power, is within the nonlinear regime.

FrankTian0012 commented 9 months ago

Thank you very much! Sorry for one more question for performance evaluation, Why not just use a series of launch powers to train separately thus getting a line of effSNR or MI based on different launch powers. Is it because it's going to take a long time?

Rassibassi commented 9 months ago

Yes, due to limitations in computational resources and the limited personal time I could dedicate post-PhD

FrankTian0012 commented 9 months ago

Yes, due to limitations in computational resources and the limited personal time I could dedicate post-PhD

Yeah thank you very much. Really appreciate that. You help me a lot understanding that!

FrankTian0012 commented 9 months ago

Hi Rasmus~I'm sorry to bother you again. Now I have one other problem. It is about the symbol error rate and bit error rate shown in your simulations. Take GS_GMI for example. It seems like the BER is too high as the BER or SER should be lower than 10^-4 so that the system is correct or meaningful. But in your examples, BER=0.03 in GS_GMI. Do you know how to improve the model so that the BER turns to a quite low level? Thank you very much!

Rassibassi commented 9 months ago

If you say meaningful, then you mean the error rate is too high for actually being a robust transmission? The simulated transmission distance is quite far for a constellation order of 64, so you could:

decrease the number of spans
decrease the span length
decrease the noise figure
choose lower constellation order
optimize the launch power (might work, but maybe not enough improvement for error-free transmission)
etc.

However, GMI is the metric you should be interested in: https://ieeexplore.ieee.org/abstract/document/7138570/

FrankTian0012 commented 9 months ago

Thank you very much. Yes, GMI should be the metrics. Can I understand it this way? We focus on GMI as a metric because even though the pre-BER is high in the simulation mentioned earlier, we assume that the post-BER is quite low since GMI has been optimized? However, since our ultimate goal is to achieve error-free transmission, how do we know in what range the GMI will meet the requirements? Takethe simulation mentioned before as an example, is the GMI = 5.37 reasonable results?

Rassibassi commented 8 months ago

If we're aiming for a system that can be deployed, focusing on error rates is crucial. However, if our goal is to demonstrate state-of-the-art shaping gains in both linear and nonlinear channels, we need to consider a different approach. For a 64-QAM system capable of robust transmission, we'd target a GMI close to 6.0 bits per symbol, which is the upper limit. Choosing a system setup that already operates near this limit leaves little room to showcase shaping gains. The novelty of our study lies in demonstrating these gains with an autoencoder-like ML method, especially constellation shaping that accounts for nonlinear effects.

Regarding the discussion on pre-FEC and post-FEC BERs, it's important to note the challenges in simulating post-FEC BERs due to the extremely low BER requirements (e.g., 10^-12 to 10^-15) for reliable transmission. This has led to the reliance on pre-FEC BERs for practical reasons. However, as shown in Alvarado et al. [1], pre-FEC BERs, derived from hard decision demapping, are not accurate predictors of post-FEC BERs. In contrast, GMI, calculated from soft per bit log-likelihood values before decoding, has proven to be a reliable indicator of post-FEC performance.

[1] https://ieeexplore.ieee.org/abstract/document/7138570/

Actually, we are optimizing for bitwise cross entropy, not for GMI directly. One can probably show that the GMI is lower bounded by the bitwise cross entropy, so GMI is indirectly optimized for. For evaluation we focus on GMI as it is the best predictor of post-FEC BER and, by extension, overall system performance. Then the choice for the system setup, leading to a rather low GMI, 5.37 bit per symbol, is due to being able to present shaping gain.

FrankTian0012 commented 8 months ago

Thank you very much! And appologies for responding late. I asked silly questions now I get it. Also when I did simulation with PS in NLIN model. I noticed that thetemperature=10 of Gumbel Softmax seems better performance. Anyway , thank you very much. You are really patient and nice.

Rassibassi / claude