eclipse-californium / californium

CoAP/DTLS Java Implementation
https://www.eclipse.org/californium/
Other
729 stars 367 forks source link

Issues with parsing CoRE Link Format #2274

Open v-po opened 2 months ago

v-po commented 2 months ago

I have encountered some issues with the LinkFormat.parse() method when parsing certain payloads. I realize the current implementation is very simple. And maybe the following points are not even considered bugs, in which case feel free to close this issue. I'm also hoping that I have not completely misunderstood the main RFC 6690, or any of the linked ones :sweat_smile:.

The current parser seems to have the following problems:

1. Incorrect handling of certain characters in Attribute Values

Parsing stops when encountering certain characters that are valid in relation-types. Specifically, it fails when handling characters such as dots (.), hyphens (-), and colons (:).

var parsed = LinkFormat.parse("</some/uri>;rt=some.resource-type;if=some.interface-description:v0.13");
System.out.println(parsed);

Expected output:

[</some/uri>
    rt: some.resource-type
    if: some.interface-description:v0.13]

Actual output:

[</some/uri>
    rt: some]

It appears that the issue is caused by the regex pattern used for WORD.

public static final Pattern WORD = Pattern.compile("\\w+");

However, I'm not sure if simply extending this regex to include those characters will fix it. It might even make it worse :)

2. Incorrect handling of quoted strings for non-relation-type attributes

The current parser splits all quoted strings into separate values when encountering spaces, regardless of the attribute type. If I understood correctly, only specific attributes (such as rel, rev, rt, and if) should have their quoted strings split into space-separated values. All other attributes should treat quoted strings as single entities, even if they contain spaces.

var parsed = LinkFormat.parse("</some/uri>;b=\"abc def\"");
System.out.println(parsed);

Expected output:

[</some/uri>
    b: abc def]

Actual output:

[</some/uri>
    b: [abc, def]]

3. (Maybe issue) Handling of attributes without values (e.g., observability flags)

I decided to represent them as the empty list, instead of the list containing the empty string as californium treats them, since an attribute like a="" seemed distinct than just a

I'm sure there are other issues with the parser (for example I didn't see any handling of ext-value) as the whole specification is (imo) a giant mess of interlinking RFCs. These are just the ones that I came across.

My own solution

I don't know if this is of value to anyone else, but this is the code that I'm currently using to using to get around these issues. It's nowhere near "RFC-compliant" and also not very efficient with the String allocations.

Core Links parser in Kotlin ```kotlin typealias AttributeName = String typealias AttributeValues = List data class CoreLinks(val links: Map) data class CoreLink(val uri: URI, val attributes: Map) fun parseCoreLinkFormat(input: String): CoreLinks { val linksMap = mutableMapOf() val entries = splitLinkValueList(input) for (entry in entries) { val uriReference = extractUriReference(entry) ?: continue val attributes = extractLinkParams(entry.substringAfter(">").trimStart()) linksMap[uriReference] = CoreLink(uriReference, attributes) } return CoreLinks(linksMap) } private fun splitLinkValueList(input: String): List { val entries = mutableListOf() val currentEntry = StringBuilder() var insideUri = false var insideQuote = false for (char in input) { when (char) { ',' -> { if (!insideUri && !insideQuote) { // If we encounter a comma and are not inside a URI or quoted string, // treat it as a delimiter for splitting link-value entries entries.add(currentEntry.toString().trim()) // Add the current entry to the list currentEntry.clear() // Clear the StringBuilder for the next entry } else { // If inside URI or quoted string, treat the comma as part of the entry currentEntry.append(char) } } '<' -> { // Mark the start of a URI-Reference (inside <...>) insideUri = true currentEntry.append(char) } '>' -> { // Mark the end of a URI-Reference insideUri = false currentEntry.append(char) } '"' -> { // Toggle the insideQuote flag when encountering a quote character insideQuote = !insideQuote currentEntry.append(char) } else -> { // For all other characters, just append them to the current entry currentEntry.append(char) } } } if (currentEntry.isNotEmpty()) { entries.add(currentEntry.toString().trim()) } return entries } private fun extractUriReference(entry: String): URI? { val start = entry.indexOf('<') val end = entry.indexOf('>') if (start == -1 || end == -1 || end < start) { println("URI-Reference incorrectly formatted: $entry") return null } return try { URI(entry.substring(start + 1, end)) } catch (e: Exception) { println("Error constructing URI: ${e.message}") null // Handle invalid URIs } } private fun extractLinkParams(paramsPart: String): Map> { val linkParams = mutableMapOf() val parts = paramsPart.split(";").map { it.trim() } if (parts.isEmpty()) return linkParams // We iterate parts from index 1 because paramsPart starts with a ";", so index 0 will be an empty string after split for (i in 1 until parts.size) { val attrParts = parts[i].split("=", limit = 2).map { it.trim() } if (attrParts.size == 2) { val key = attrParts[0] val value = attrParts[1] val values = when { value.startsWith("\"") && value.endsWith("\"") -> { // Handle "quoted strings" for relation-types if (key == RELATION || key == REVERSE_RELATION || key == RESOURCE_TYPE || key == INTERFACE_DESCRIPTION) { value.removeSurrounding("\"").split(" ") } else { listOf(value.removeSurrounding("\"")) } } else -> listOf(value) } linkParams[key] = values } else if (attrParts.size == 1) { // Handle attributes without values (e.g., observability flags) linkParams[attrParts[0]] = listOf() } } return linkParams } ```
boaks commented 2 months ago

To be frank, I'm no expert for RFC 6690. Over the years not too many seems to use that extensively. What's is your use-case, which is affected by the misbehavior?

v-po commented 2 months ago

@boaks Not entirely sure what you mean by use-case, but we are using parts of californium in an Android app and in a Resource Directory backend. And we have IoT devices that send data in the formats outlined above (when querying "/.well-known/core") which caused problems when we tried using the LinkFormat parser. For us it's not a very big issue, as we don't use the californium CoapClient, so we can get by with our own Core Links parser. I guess it will be a problem for someone who uses CoapClient.discover() and receives data like the one specified above.

boaks commented 2 months ago

Not entirely sure what you mean by use-case

"Resource Directory" is the kind of answer I was looking for.

For us it's not a very big issue

OK.

If you're interested to get answers for your points and interpretations from the IETF core-group, the easiest way is to copy the issue into corrclar issue.

We can alternatively convert your solution into a PR and see, if someone has issues with it.

(I'm personally on the way to ramp down my engagement to a hobby/free time level. Therefore the handling of the clarification on my side will take some time.)

sbernard31 commented 2 months ago

On my side, I never used Californium classes to parse Link Format. But I faced this RFC 6690 with LWM2M and I also feel this is a not so easy to implement RFC. (especially because of extensibility ...)

In Leshan we try to implement a CoRE Link Parser, all classes are in leshan-lwm2m-core@org/eclipse/leshan/core/link/