Open SxN02 opened 3 years ago
This would not apply to packet data, as the language that it's in is, from the point of view of pcapng, called "raw binary".
Some data might happen to be text, but that data might carry its own language tags, such as HTML language tags. Those tags, not any tag in the capture file, should indicate the language from which to translate.
So this would apply only to data defined as text in the pcapng specification itself. Thus, it would currently apply to the opt_comment, custom string, shb_hardware, shb_os, shb_userappl, if_description, if_os, and if_hardware options. It would not apply to:
Note also that there is no guarantee that all options in a block are in the same language; you might have an interface whose description was written in Simplified Chinese, with a hardware description in Traditional Chinese, about which comments in the capture file have been written in English and Russian.
So this might take the form of an option that, if it appears before another option, indicates the language of that option (doing nothing if the option is not one that's in a language). I.e., it's a non-locking shift; a locking shift is another possibility.
In terms of which text field would be a good candidate to localization and which not, with how I understand this format I would agree with the list above. Where I think we see it differently is in the language declaration via an HTML tag, which is useful in the complex example provided, but otherwise largely optional. Having a language declaration as a field would force writers to honour it accordingly, so, what a Canadian writer honours, an American reader can render as "honors", reliably, on the fly. HTML tags, of course, can override the field.
Where I think we see it differently is in the language declaration via an HTML tag, which is useful in the complex example provided, but otherwise largely optional.
I didn't propose HTML tags for anything other than HTML data in packets in the capture, and all I noted there is that 1) it makes no sense to have a language tag for packet data, as the packet data is what it is, and it's either identified as such in the data, in which case that's what should be used if an application translates HTML text in packets, or it's not identified as such, in which case it's not clear how it would be identified by options in the packet, especially given that a given Web page in a capture might be in more than one language.
HTML tags simply wouldn't apply to text options in a pcapng block; there's no HTML there to even contain them, unless there's HTML in the capture, and even there, whatever random Web traffic you might have captured should not have any effect on a pcapng reader's notion of what language a comment attached to a packet is in.
I would like to suggest an addition to pcapng, in the form of an IETF language tag, where it is applicable. First it came to mind to have it associated with the field _optcomment (section 3.5), but then I realized that it may add value elsewhere too, so it should be, perhaps, part of block headers if it is relevant in blocks.
The reason for this addition is to point to the original language, giving (to applications rendering pcapng) information for on-the-fly translation. The language tags can be recognized as strings starting with a letter, ending in a letter and, optionally, having between start and end letters and/or dashes, but not consecutive dashes. A "smart" application may be capable of indexing and searching in both the original language and the target language.
I believe it to be a very small addition and yet have an important contribution to future-proofing the format. Please share your thought on it.