defgsus / teletext-archive-2023

daily commits of german teletext pages
2 stars 0 forks source link

Cool project #1

Open Casandro opened 9 months ago

Casandro commented 9 months ago

Just to give you a heads-up, I'm also archiving teletext pages, but I'm doing it from the DVB signal. I typically upload them to archive.org, but I can set up an rsync share if you want. https://archive.org/details/teletext-2023-01

The software is here: https://github.com/Casandro/teletext_ng

defgsus commented 9 months ago

Hi @Casandro, this is very cool.

I did not know about the upload-to-archive.org way. I also wonder how long github will tolerate archive repos like this one. Currently, i do create a new one every year to keep them somewhat limited in size.

That said, i don't think i should include more data in this archive that is already archived somewhere else. But i much appreciate your links! A lot of channels are not easily accessible through their "teletext" websites.

defgsus commented 9 months ago

Actually, it may be possible to browse the content of your zip files online. I started this viewer a while ago: https://github.com/defgsus/teletext-viewer

It should be possible to fetch the zip files and extract them, all in the browser's javascript engine. It's just a bit of traffic if each file includes a whole year.

Casandro commented 9 months ago

Cool, I'll have to look into this. Obviously I could just host raw t42 files myself or just extract them with a CGI-script or something.

defgsus commented 9 months ago

Hey, i'm actually a bit stuck with the t42 format. I'm reading https://www.etsi.org/deliver/etsi_i_ets/300700_300799/300706/01_60/ets_300706e01p.pdf but it's super complex.

I figured that most bytes & 0x7f represent characters (the 8th bit is probably the hamming bit?) Although the characters do not seem to fit any of the g0 to g3 mappings.

Are there colors encoded? What's the first 2 bytes in each row. Please point me to some simple documentation, if possible.

Casandro commented 9 months ago

Well last year I did some talks about the basics: https://media.ccc.de/v/fire-shonks-2022-49077-die-technik-hinter-teletext https://media.ccc.de/v/retronetcall-20230705-casandro-teletext https://www.youtube.com/watch?v=ITQkgM9AihE Essentially those are the packets as they are transmitted to the decoder, including the redundant information. The first 2 bytes encode the packet address. It uses a code to turn 4 bits into 8 bits. Therefore those 2 bytes encode 1 byte of information, the "packet address". It determines how the rest should be interpreted. 3 bits of that determine the "magazine", which is the first digit of the page number, while the other 5 bits determine the row address. Row 0 means that it's a header row, where the first octets encode the subpage number and flags. Row 1-24 are "directly displayable rows" which contain 7-Bit character codes with an added 8th bit for parity. There are further rows that allow individual characters to be overwritten or other colours to be used. That's for example, the "Impressum" Pages on German language services manage to have both the @-Symbol and Umlauts at the same page. A decoder has to be at least Level 1.5 to be able to decode that.

Usually colours are encoded in a fairly simple way. As Teletext always transmits complete rows, there is no use for standard ASCII control codes like CR or LF. Therefore the control code area is used for attributes. For example there are 8 codes that switch the text colour to one of the 8 basic colours. Those control codes are displayed as spaces. There are also codes that switch to "mosaic"-mode or switch the background colour to the current foreground colour.

Casandro commented 9 months ago

BTW if you have questions, just ask. :)

defgsus commented 9 months ago

Thank you! The week has just started and i have to work on more recent technologies. web2.0 instead of teletext :rofl: But i will come back to it.

defgsus commented 9 months ago

Okay, there are a couple of questions. Meanwhile i'm reading https://zxnet.co.uk/teletext/teletext-resources/renderer.js and i think i need to decide what i actually want to do. The zxnet people built a pretty complete decoder with all the subtle details. For my statistical interests, extracting text with correct ümlauts is already enough. Colors are just nice-to-have.

On the other hand, i like the idea of a teletext viewer that can browse through history and looks good (in a 80ies way). I won't spent time on it right now but winter is just starting ;) I'm sure we can use the zxnet code for a web-based viewer. The one i linked to above was a proof of concept of a static github webpage that accesses all the chronologically committed data from a repository. The frontend code uses react and has some terrible state handling. For the next version i want to try sveltejs. and make the whole page more user-friendly.

one-channel one-timestamp access to individual zip files would be great then. also some kind of index.

liebe grüße erstmal

Casandro commented 9 months ago

Well the current format is optimized for storage. That's why the innermost zip files are not compressed, but the next level up is. Ich glaube übrigens nicht, dass es sich noch lohnt da Zeit in dieses "Web-Dings" reinzustecken. Das ist schon ziemlich alt und kaputt und es gibt nur noch 3 Implementierungen des benötigten "Browsers".

Casandro commented 9 months ago

Übrigens im Teletext_NG Projekt sind auch Skripte drin die die T.42 Daten in Text (mit und ohne ANSI-Farben) extrahieren können.

defgsus commented 9 months ago

Du meinst wahrscheinlich dump_tta_text_colour.c. Ja, das sieht ganz nützlich aus. Ich probier mal damit rum.

Dein Vortrag bei Vintage Computer F. ist übrigens sehr unterhaltsam, für so Leute wie uns.. Die CCC Sachen auch, aber dieses "Reindrehen" des Protagonisten neben die Slides ist so richtig schön nerdig :rofl:

Casandro commented 9 months ago

Ja, ich hab mir eine blaue Wand besorgt, damit das nicht so statisch wird. :)