flathub-infra / flatpak-external-data-checker

A tool for checking if the external data used in Flatpak manifests is still up to date
GNU General Public License v2.0
116 stars 34 forks source link

htmlchecker: allow specifying error handling on encoding error #410

Closed Bixilon closed 1 month ago

Bixilon commented 5 months ago

I need to fetch a binary encoded file, which contains the update name. The file is binary and just contains some readable strings. The problem is, that it already fails with Error querying for new versions: 'utf-8' codec can't decode bytes in position thus I am not able to apply any regexes on it. Now I can set an attribute called encoding-error to ignore. This is kinda hacky.

I am using the following checker code which now just works fine (tested it):

        x-checker-data:
          type: html
          url: http://versions.teamspeak.com/ts3-client-2
          version-pattern: "\u0006stable\u0010.*3\\.(\\d+\\.\\d+)\u0012"
          encoding-error: ignore
          url-template: https://files.teamspeak-services.com/releases/client/3.$version/TeamSpeak3-Client-linux_amd64-3.$version.run
Bixilon commented 5 months ago

So, anything on this?

dbnicholson commented 2 months ago

I'm not the maintainer here, but I don't think you should try force a binary download through htmlchecker. HTML by definition is a text language. I took a look at http://versions.teamspeak.com/ts3-client-2, and it's definitely not HTML. What an odd choice to encode that in a custom binary format instead of JSON or something.

As it turns out, I think this is protobuf format.

$ hd ts3-client-2 
00000000  08 05 12 16 0a 06 73 65  72 76 65 72 10 e5 b0 f0  |......server....|
00000010  f0 05 1a 06 33 2e 31 31  2e 30 12 1e 0a 0f 61 6c  |....3.11.0....al|
00000020  70 68 61 5f 6c 69 6e 75  78 5f 78 38 36 10 e6 c3  |pha_linux_x86...|
00000030  f9 fd 05 1a 05 33 2e 35  2e 36 12 1d 0a 0e 62 65  |.....3.5.6....be|
00000040  74 61 5f 6c 69 6e 75 78  5f 78 38 36 10 e6 c3 f9  |ta_linux_x86....|
00000050  fd 05 1a 05 33 2e 35 2e  36 12 1f 0a 10 73 74 61  |....3.5.6....sta|
00000060  62 6c 65 5f 6c 69 6e 75  78 5f 78 38 36 10 e6 c3  |ble_linux_x86...|
00000070  f9 fd 05 1a 05 33 2e 35  2e 36 12 13 0a 04 62 65  |.....3.5.6....be|
00000080  74 61 10 dd ff aa a8 06  1a 05 33 2e 36 2e 32 12  |ta........3.6.2.|
00000090  15 0a 06 73 74 61 62 6c  65 10 dd ff aa a8 06 1a  |...stable.......|
000000a0  05 33 2e 36 2e 32 12 14  0a 05 61 6c 70 68 61 10  |.3.6.2....alpha.|
000000b0  e9 f7 96 ab 06 1a 05 33  2e 36 2e 33 18 04        |.......3.6.3..|
000000be
$ ~/go/bin/protoscope ts3-client-2 
1: 5
2: {
  1: {
    14:SGROUP
    12: 4.5449766e30i32   # 0x72657672i32
  }
  2: 1578899557
  3: {"3.11.0"}
}
2: {
  1: {"alpha_linux_x86"}
  2: 1606312422
  3: {"3.5.6"}
}
2: {
  1: {"beta_linux_x86"}
  2: 1606312422
  3: {"3.5.6"}
}
2: {
  1: {"stable_linux_x86"}
  2: 1606312422
  3: {"3.5.6"}
}
2: {
  1: {"beta"}
  2: 1695203293
  3: {"3.6.2"}
}
2: {
  1: {"stable"}
  2: 1695203293
  3: {"3.6.2"}
}
2: {
  1: {"alpha"}
  2: 1701166057
  3: {"3.6.3"}
}
3: 4

It looks like each item is a tuple of name, time of update and version number. While you could probably get away with parsing it with a regex, it's certainly not robust. This seems like it needs to be a custom checker to be done correctly.

Alternatively, there could maybe be a type: raw checker that reads in binary data and then uses a binary regex before decoding the match back to a string.

wjt commented 1 month ago

Agreed. I don't think the html checker is the right tool to use here.

Happily it seems that Teamspeak is covered by release-monitoring.org https://release-monitoring.org/project/8714/ so you can use the anitya checker.

Bixilon commented 1 month ago

@dbnicholson Thats a interesting call, did not notice it (not worked with proto buf before)

@wjt Agreed, but maybe there is future use for this, there are broken webpages. But Yes, I am abusing it for my usecase.