IRT-Open-Source / scf

Subtitling Conversion Framework
Apache License 2.0
52 stars 18 forks source link

Support for embedding STL files #39

Open braincoded opened 7 years ago

braincoded commented 7 years ago

It would be a very useful feature for SCF to add support for handling embedded STL files (BASE64 encoding) during the conversion from STL to EBU-TT and to be able to extract embedded files.

http://bbc.github.io/subtitle-guidelines/#Embedded-STL

spoeschel commented 6 years ago

Sorry; closed by mistake (internal issue ID)

ianbarrett commented 4 years ago

Hi,

At Channel4 we have our entire history of subtitles in EBU-STL format, and our new media services provider (who will maintain our repository of subtitles) will be storing, and expect new subtitles to be delivered in EBU-TT format. We need to have the source STL file embedded within this, to enable simplistic extract and delivery to our playout provider who are still working in EBU-STL.

Tools that we use for authoring subtitles don't current support this format, so we are looking to export STL files still, and then convert them (ideally using this SCF) to deliver them to our media services provider with the STL embedded so that they have a consistent way of receiving and processing subtitle files.

Thanks,

Ian

andreastai commented 4 years ago

@ianbarrett makes sense and thanks for your request and does also match the request from @braincoded.

Tunneling of binary data is defined now in EBU Tech 3390 (EBU-TT Metadata). It was before defined in EBU Tech 3350 (EBU-TT Part 1). @ianbarrett and @braincoded Please have a look at Tech 3390 Chapter 3.5 if this match your requirement.

We should follow also Tech 3360 which is the guideline to convert EBU STL to EBU TT Part 1. See Chapter 2.3 of Tech 3360. One implementation question is where to place the tunneled data. It may be good to follow the recommendation of Tech 3360 to place it at the end.

andreastai commented 4 years ago

For the implementation this might be an option when converting STL to STLXML with the stl2stlxml Python Script. From there it can be added to the EBU-TT Part 1 file.

Another interesting question is about roundtripping. SCF provides a way to convert Part 1 (i.e. STLXML) back to STL. If a Part 1 document has an embedded STL the question is if the user needs an option so that he can decide if he wants to convert the EBU-TT file or the embedded STL file back to the binary STL file.

ianbarrett commented 4 years ago

@tairt I'd be very happy to follow the guidance in the Tech 3360 and Tech 3390.

I think putting it at the end is the best option here. We should also try and follow the other rules in these specifications

eg: Removing the following fields : ebuttm:stlCreationDate ebuttm:stlRevisionDate ebuttm:stlRevisionNumber

and instead populating the attributes of the ebuttm:binaryData field.

Ideally we would also include the relevant fields/section in the head/metadata: ebuttm:documentOriginatingSystem ebuttm:conformsToStandard ebuttm:appliedProcessing ebuttm:stlConversion/* ebuttm:subtitleZero

In terms of the roundtripping question, I think having the option about which you would like to use is a good idea. I see in the spec, that the expected behaviour over time is that if the XML is updated, so the STL is no longer valid, the binaryData should be removed.... but it may be an interesting way to track changes since conversion if you were to restore both the XML, and the embedded STL, and compare them. It's a bit academic, but could be useful.

andreastai commented 4 years ago

@ianbarrett Thanks!

I think putting it at the end is the best option here. We should also try and follow the other rules in these specifications eg: Removing the following fields : ebuttm:stlCreationDate ebuttm:stlRevisionDate ebuttm:stlRevisionNumber

True...I haven't thought of that...

Ideally we would also include the relevant fields/section in the head/metadata: ebuttm:documentOriginatingSystem ebuttm:conformsToStandard ebuttm:appliedProcessing ebuttm:stlConversion/* ebuttm:subtitleZero

At the moment there is no option to populate EBU-TT files with other data than that is in the STL file itself (@spoeschel correct me if I am wrong). You may have to add an additional step to include this data...

In terms of the roundtripping question, I think having the option about which you would like to use is a good idea. I see in the spec, that the expected behaviour over time is that if the XML is updated, so the STL is no longer valid, the binaryData should be removed.... but it may be an interesting way to track changes since conversion if you were to restore both the XML, and the embedded STL, and compare them. It's a bit academic, but could be useful.

Yes, I can see the use case for it...

spoeschel commented 3 years ago

At the moment there is no option to populate EBU-TT files with other data than that is in the STL file itself (@spoeschel correct me if I am wrong). You may have to add an additional step to include this data...

This is correct. We just provide parameters to modify the handling of timecodes (e.g. to apply a specific offset).

For the implementation this might be an option when converting STL to STLXML with the stl2stlxml Python Script. From there it can be added to the EBU-TT Part 1 file.

This could be done as follows:

Another interesting question is about roundtripping. SCF provides a way to convert Part 1 (i.e. STLXML) back to STL. If a Part 1 document has an embedded STL the question is if the user needs an option so that he can decide if he wants to convert the EBU-TT file or the embedded STL file back to the binary STL file.

Such an option would have to be added to `EBU-TT2STLXML" then. However, it would change the output format from STLXML to plain STL. So I think this is a task a simple helper XSLT (or XQuery?) module could handle instead, as this would be simply extracting/decoding the binary payload.

@braincoded / @ianbarrett - would all this work for you?

andreastai commented 3 years ago

if present) copy the source STL file to a ebuttm:binaryData element, storing it below the last tt:div of the result

Note that the recommended practice is to add the <ebuttm:binaryData> element inside a <tt:metadata> element that is a child of the last <div> element. This <div> element has no further <p> element as children.

See also Tech Tech 3360 page 28 and Tech 3390 page 21.

spoeschel commented 3 years ago
  • extend our STLXML format
    • contain the source STL file (encoded as Base64; e.g. below an optional element /StlXml/StlSource)
    • this also includes the source STL filename as /StlXml/StlSource/@filename, as Part 2 requires it (in Part M, it is optional)

To allow for possible future extensions (we currently can't think of), it would be better to instead use separate elements below /StlXml/StlSource to store filename (./Filename) and the actual data (./Data).

spoeschel commented 3 years ago
  • STLXML-SplitBlocks: keep a present StlSource element unmodified

This is indeed already the case; in general all content is copied unmodified by this module.

spoeschel commented 3 years ago

We just released version 1.9.0 that supports optional storage/tunnelling of the STL source file via new parameters, added to the modules STL2STLXML and STLXML2EBU-TT. Note that the overall conversion from EBU STL to EBU-TT is done according to EBU Tech 3360 v0.9, as well as the newly added STL source file processing (which is slightly different compared to v1.0).