library doesn't reencode files in the same way as they were originally encoded

ThaUnknown commented 1 year ago

running a file tru the decoder then the encoder again doesn't always yield the same results, sometimes it does sometimes it doesnt

reproduction code:

import { createReadStream } from 'fs'
import { EbmlStreamDecoder, EbmlStreamEncoder, EbmlTagId } from 'ebml-stream'
import { pipeline as _p, Readable, Transform } from 'stream'
import { promisify } from 'util'

const pipeline = promisify(_p)

const ebmlDecoder = new EbmlStreamDecoder()
const ebmlEncoder = new EbmlStreamEncoder()
const hasher = createHash('sha1').setEncoding('hex')
await pipeline(createReadStream('media/' + file), ebmlDecoder, ebmlEncoder, hasher)

const hash = hasher.read()
console.log(hash)

// vs

const hasher = createHash('sha1').setEncoding('hex')
await pipeline(createReadStream('media/' + file), hasher)

const hash1 = hasher.read()
console.log(hash1)

yields different outputs, mainly different lengths, I've identified this to be master tags most of the time, specifically Segment and EBML

notably from the media/ folder, audiosample.webm yields a different length after reencoding, while video-webm-codecs-avc1-42E01E.webm yields the same result

is this caused by corrupted blocks?

ThaUnknown commented 1 year ago

@austinleroy I've created a simple test to find at what byte, tag and startposition the reencoding fails

let lastTag = null
let i = 0
await pipeline(
  filestream,
  new EbmlStreamDecoder(),
  new Transform({
    transform (tag, _, cb) {
      lastTag = tag
      cb(null, tag)
    },
    readableObjectMode: true,
    writableObjectMode: true
  }),
  new EbmlStreamEncoder(),
  new Transform({
    transform (chunk, _, cb) {
      const start = i
      for (const byte of chunk) {
        if (byte !== data[i]) {
          console.log('failed on byte ' + i, 'chunk startposition ' + start, lastTag, EbmlTagId[lastTag.id])
          filestream.destroy()
          break
        }
        ++i
      }
      cb(null, chunk)
    }
  }),
  createHash('sha1').setEncoding('hex')
)

austinleroy commented 1 year ago

It looks like this is a result of this code. I think this was added because I ran into a webmplayer that would break when Segment/Cluster lengths weren't written using a full 8-byte VINT. I don't think that the resulting file is necessarily corrupt, it just isn't as small as it could be.

ThaUnknown commented 1 year ago

It looks like this is a result of this code. I think this was added because I ran into a webmplayer that would break when Segment/Cluster lengths weren't written using a full 8-byte VINT. I don't think that the resulting file is necessarily corrupt, it just isn't as small as it could be.

maybe, I'd need to test, I wanted to do this to allow in-place edition of tags

The general idea is that browsers don't support multiple tracks, so by changing the default video/audio tracks on the fly in place one could change what track is being played, as browsers play the default track [usually], but I ran into this issue, because it all need to be done in place with the same length so transmuxing wasnt needed

another issue that i didn't fully track down, was that the default flags on tracks weren't being set [I THINK] but I never double checked if that was the case because of the issue described in this PR

ThaUnknown commented 1 year ago

would simply removing the if check fix this issue?

austinleroy commented 1 year ago

It probably won't hurt, but looking through the code I'm seeing some other potential problems. They may not be an issue if you aren't using Unknown-sized tags, though.

ThaUnknown commented 1 year ago

I'd like to get it working with any mkv file out there in the wild, I failed to fix this myself which is why I originally made this issue, hoping you'd be able to make a commit/release that fixes this

ThaUnknown commented 1 year ago

@austinleroy the code you link wasn't the problem, tho removing it did create different outputs 1st hash is original file 2nd is after change 3rd is before change

TLDR something changed, but the files that were fine before are still fine, and ones that were broken before are still broken

austinleroy commented 1 year ago

I found some other potential causes of changing file content due to there being multiple different valid VINT representations of the same value (e.g. "0x4001" and "0x81" are VINTs with the value "1") and differences in writing int/float lengths. I pushed a couple of updates to a branch "PreventUnintendedChanges" in this repo. If you want to test with that it may be better for your use case. I don't think the hashes will match yet (there are still some discrepancies in parsed vs written float values). But the lengths should match, at least.

ThaUnknown commented 1 year ago

sorry I think I got lost in meaning, by "exact length" I meant "tags are in the same positions", because i'd prefer to create in-place manipulations rather than reencode the entire stream, but if this works, then i'd be fine with it //testing

ThaUnknown commented 1 year ago

oh god, i realised how flawed my understanding of streams was!!!!!! i completly forgot about backpressure, so half these tests wont work correctly, oopsey!

ThaUnknown commented 1 year ago

this is very very haphazard, but here's a test implementation:

import { createReadStream, readFileSync } from 'fs'
import { EbmlStreamDecoder, EbmlStreamEncoder } from '../lib/index.js'
import { pipeline as _p } from 'stream'
import { promisify } from 'util'
import { createHash } from 'crypto'

const pipeline = promisify(_p)

const files = ['video1.webm', 'video2.webm', 'video3.webm', 'video4.webm', 'test5.mkv']

for (const file of files) {
  console.log(file)
  const hasher1 = createHash('sha1').setEncoding('hex')
  const stream1 = createReadStream('media/' + file)
  await pipeline(stream1, hasher1)
  console.log(hasher1.read())

  const hasher2 = createHash('sha1').setEncoding('hex')
  const stream2 = createReadStream('media/' + file)
  await pipeline(
    stream2,
    new EbmlStreamDecoder(),
    new EbmlStreamEncoder(),
    hasher2
  )
  console.log(hasher2.read())

  const filestream = createReadStream('media/' + file)
  const pip = _p(
    [
      filestream,
      new EbmlStreamDecoder(),
      new EbmlStreamEncoder()
    ], () => { }
  )
  const res = await stream2buffer(pip)
  const data = readFileSync('media/' + file)
  console.log(res.length === data.length)
}

function stream2buffer (stream) {
  return new Promise((resolve, reject) => {
    const _buf = []

    stream.on('data', (chunk) => _buf.push(chunk))
    stream.on('end', () => resolve(Buffer.concat(_buf)))
    stream.on('error', (err) => reject(err))
  })
}

drop this in ./test/file.js then run it with node ./test/file.js

this seems to work fine for webm's but fails for mkv's:

ThaUnknown commented 1 year ago

I also have an expanded list of EbmlTags that should include most mkv tags ~~EbmlTagId.js~~ I force pushed and overwrote all these tags, I want to cry, all my work here's the test video i used for mkv: https://files.catbox.moe/zznt0e.mkv rest of the videos are just your webms renamed to video1 ...etc

I have very very little time to work on this, so I hope this is enough for you to get reproduction steps, I am however immensely interested in this issue, because it has the potential for solving one of the longest standing issues on the web video: which is lack of multi-track audio and video support, currently to solve this issue videos are muxxed on the server, which is intensive for the servers, if this instead could be done on the client with js with this library it would offer much better scaling for such services

I hope you don't take this as "he just wants to throw work at me" I am thankful you even looked at this <3

ThaUnknown commented 1 year ago

I created a mapping of all ebml tags according to spec, then overwrote all the work I did a few days back, I want to punch something, sorry that js/txt file i sent isnt actually helpful

ThaUnknown commented 1 year ago

ebmltags.json here's the actual list of all matroska tags in a json file, note that the id is a hex value in string form so you'll need some parsing these are taken from the official documentation of matroska

do note that these duplicate with webm's so they need to be de-duped, which is what i had done before and is the progress i lost

austinleroy / node-ebml

library doesn't reencode files in the same way as they were originally encoded #4