Tcp streams with hektor-fsm

Sebitosh commented 6 months ago

Hello,

This is my work in progress for aboutsip/pkts#147 , I am making a draft PR because the additions are starting to become a bit much (and because I was quite slow at making them).

At first I was going for a "standard OOP" approach for the Tcp stream logic, but since the comment on the issue I looked at the hektor library and now I am trying to use it to handle the stream logic. The biggest steps now are linking the FSM to the TcpStreamHandler and of course making tests for all of this. I also looked at the snice ConnectionId, but it looked a bit complicated for the usage I have here, so I worked out a TransportStreamId. This still has my attention for the coming days (as I hope to finish this work in not so long), and any feedback is appreciated (at this stage especially on the FSM, I need to document it of course but here I am kind of hoping I understood and used the hektor-fsm library correctly). I also spend a lot of time understanding the libraries and defining the streams exactly. I should be faster at producing code from now on.

From here I still need to:

[x] Make tests for the FSM
[x] Refactor some of my initial work around the use of the tcp FSM
[x] Make the streamhandler use the TcpStreamFSM for it's processFrame function
[x] Make tests for the different data structures (transportstreamid, tcpstream, ...)
[x] Document code (except duplicate and stream handler)
[x] Cleanup code (except duplicate and stream handler)
[x] Make end to end tests interpreting pcap files and cheking if they correctly organize the TCP streams for general cases
[x] Make end to end tests for corner cases with synthetic pcaps where the 5-tuple of a stream is reused for multiple connections
[x] Correct and Define the logic for the handling of the corner cases
[x] Make example file
[x] Last clean/doc before review

Sebitosh commented 5 months ago

Hey, sorry for late review. I just focused on the FSM and overall, seems that you're getting the idea of it. I just had a few comments. Looking forward to the "full" pull request, at which point I will look through it all in more detail.

Thank you for the feedback ! I got a bit delayed but I am working on it this week and next week, I hope until completion.

Sebitosh commented 5 months ago

I am currently in the process of trying out classifying different TCP streams. I noticed an interesting difference with how Wireshark handles streams compared to my code: when comparing the classifying from a complex pcap file, my code may have more streams than Wireshark.

This is due to how RST packets are handled: I consider the presence of a RST packet as a closing of the stream, considering that any packet with the same identifier must belong to a whole new stream a never to a closed stream. Wireshark on the other hand considers that packets from the same stream may arrive after RST packets.

I am unsure on how Wireshark computes that, I would guess by keeping track of both last arrived sequence numbers of the stream and never considering that a stream can not receive any new packets anymore (which I do, I consider that "closed" streams do not receive packets anymore). Should I try to stick as close as possible to how the Wireshark classifier behaves ?

I'll leave this question for now and focus on unit testing.

Edit: the hole in my reasonning here is that TCP packets with special flags could be duplictates or retransmission, and that using the sequence numbers is a good way of knowing if it is a duplicate (the chance of the sequence numbers colluding between streams like this is very small). I am reading up some comments in the wireshark code to understanc their logic and then implement it here.

Sebitosh commented 5 months ago

Edit: the hole in my reasonning here is that TCP packets with special flags could be duplictates or retransmission, and that using the sequence numbers is a good way of knowing if it is a duplicate (the chance of the sequence numbers colluding between streams like this is very small). I am reading up some comments in the wireshark code to understand their logic and then implement it here.

I still have some testing to do has I am still miss placing some packets in un-necessary new streams in some case(s) I still have not identified, but I am getting there with the last 2 commits

jonbo372 commented 5 months ago

Thank you for all your work on this so far! It looks to be coming along nicely!

And yeah, figuring out all the various cases with re-transmissions and how to deal with it can be quite a lot of work. I built a recording app for audio streams coming in over RTP once and it probably took a few years of heavy use in production before I felt I had found "all" cases and closed the loop on those. Reading wireshark code is probably a good idea as it has had a lot of eyes on it over the years.

I just had a quick look over your code but will refrain from a full code review until you feel that you are 100% ready for one.

Btw, the pcaps you've checked in, just make sure they don't contain any sensitive information (i haven't look so perhaps you've already made sure of this).

Again, thank you!

Sebitosh commented 4 months ago

Thank you for all your work on this so far! It looks to be coming along nicely!

And yeah, figuring out all the various cases with re-transmissions and how to deal with it can be quite a lot of work. I built a recording app for audio streams coming in over RTP once and it probably took a few years of heavy use in production before I felt I had found "all" cases and closed the loop on those. Reading wireshark code is probably a good idea as it has had a lot of eyes on it over the years.

I just had a quick look over your code but will refrain from a full code review until you feel that you are 100% ready for one.

Btw, the pcaps you've checked in, just make sure they don't contain any sensitive information (i haven't look so perhaps you've already made sure of this).

Again, thank you!

For now I still have issues with TCP Keep Alives beeing exchanged after a connection closes and their is still some issue with how I keep track of the different seq/ack numbers, so I still have some debugging to do.

The pcaps should not contain sensitive info. Most of them are synthetically generated, the biggest one is from some basic web browsing to popular websites where I did not login to anything in particular.

I am a bit busy at the moment (lots of university projects to turn in, and I've got some finals to work on after that), but as soon as I can I will continue to find what is wrong and clean up the code.

Sebitosh commented 2 months ago

Hello, after my exams and some health issues, I am back working on this PR again !

Here I pushed my first full attempt at handling duplicates and other cases of packets belonging to already closed streams. It classifies packets in the exact same streams as wireshark does for my example capture with 272 streams. However, the logic to handle those cases is certainly not complete yet. I will return to it later to focus first on making the rest of the code ready for review.

In the coming days I will focus on cleaning and documenting the code, and I will produce some small tests for the new objects along with an example file to show how to use the streams. After that I will inspect further the corner cases of the duplicate handler.

The logic I was working on is in tcpDuplicateHandler. It takes care of the following cases were a packet belongs to a stream that seems to have already been closed:

1. If the received packet corresponds to a duplicate of one of the signaling packets closing the connection (RST, FIN, or ACK of FIN) (Edit: testing further I see that specifically looking for those cases might be unnecessary, for the weird cases later may include things like TCP clients sending multiple RST packets with different sequence numbers and yet belonging to the same stream)
1. If the received packet corresponds to a packet that is out of order but still within the expected sequence numbers
1. If the received packet corresponds to a packet that is in sequence with the last packet of the stream, which happens when one or both of the TCP clients did not understand the connection is closed (and this is were things can get weird )

jonbo372 commented 1 month ago

Hey! I see the pull requests coming through, very exciting and again, thanks for all your hard work.

I just wanted to let you know that I'm currently out traveling so will not get to this until second week in August, in case you're wondering why you don't hear anything from me.

Looking forward to it!

Sebitosh commented 1 month ago

Hey! I see the pull requests coming through, very exciting and again, thanks for all your hard work.

I just wanted to let you know that I'm currently out traveling so will not get to this until second week in August, in case you're wondering why you don't hear anything from me.

Looking forward to it!

No worries, have a good vacation !

I'll finish up today (I only have some cleanup left to do before it's ready for review). I will also write a post with my conclusions on the corner cases to explain the behavior of the handler.

Sebitosh commented 1 month ago

To sum up how I handle the special cases of a connection having it's 5-tuple reused in the same file:

After some investigation, Wireshark does not use sequence numbers analysis nor does it use timestamps to classify connections using the same 5-tuple into different streams. In fact, it only splits a stream with the same 5-tuple in one case: when a new syn packet has been identified on the stream. In that case, it will consider any new packets to belong to that new stream, whatever the flags or sequence numbers. So, instead of continuing on with seq/ack analysis, I rolled back and modified the FSM to reflect the fact that the terminal state of a TCP connection for the handler is when the 5-tuple is reused by a new syn packet. This eliminates the problem of finding out what packets to attribute packets to a closed connection (which is great considering that in captured traffic their is actually a lot of weird situations where tcp clients will exchange packets even when the connection is expected to be closed).

It does however raise the question of when should the handler call the endStream function of the streamListener. For now, it calls it if after adding a packet if the state is closed or closed_ports_reused, meaning it will be called each time a packet is added to a closed connection. This contradicts the comment I found of the endStream function that says the function should be called only once to not create confusions when writing out the stream. However, the "easy" alternatives to this are calling endStream only when ports are reused or only once when the stream reaches the closed state and never after despite adding packets to it. The writing problem the endStream documentation mentions should not be a problem here as writing out a DefaultTcpStream is still unsupported though. Edit: I read the other stream handlers, figured out it would probably be best to call endStream like the SipHandler, that is only once when we first see a packet closing the stream, and never after that.

I left out implementing a timeout mechanism as Wireshark does not split streams like that (even if packets have arrival times years apart) and because I am unsure on how to do it. For the endStream call I did not find how to call it when the file is over, and I am unsure if it is necessary to do so. Important to mention that the problem with calling endStream multiple times concerns only a minority of streams I saw in my captures. I generated synthetic traffic with all the corner cases I examined and used them in unit tests to see if the handler would classify them in the same way as Wireshark (when Wireshark splits a stream because of a new syn packet, it will mark that new syn with [Ports Reused]). Important also is that MPTCP traffic will not be recognized and put together.

With all this said, I believe my work to be ready for review. I'll be there in August to incorporate all the feedback !

aboutsip / pkts

Tcp streams with hektor-fsm #148