Open ThaUnknown opened 1 year ago
@austinleroy I've created a simple test to find at what byte, tag and startposition the reencoding fails
let lastTag = null
let i = 0
await pipeline(
filestream,
new EbmlStreamDecoder(),
new Transform({
transform (tag, _, cb) {
lastTag = tag
cb(null, tag)
},
readableObjectMode: true,
writableObjectMode: true
}),
new EbmlStreamEncoder(),
new Transform({
transform (chunk, _, cb) {
const start = i
for (const byte of chunk) {
if (byte !== data[i]) {
console.log('failed on byte ' + i, 'chunk startposition ' + start, lastTag, EbmlTagId[lastTag.id])
filestream.destroy()
break
}
++i
}
cb(null, chunk)
}
}),
createHash('sha1').setEncoding('hex')
)
It looks like this is a result of this code. I think this was added because I ran into a webmplayer that would break when Segment/Cluster lengths weren't written using a full 8-byte VINT. I don't think that the resulting file is necessarily corrupt, it just isn't as small as it could be.
It looks like this is a result of this code. I think this was added because I ran into a webmplayer that would break when Segment/Cluster lengths weren't written using a full 8-byte VINT. I don't think that the resulting file is necessarily corrupt, it just isn't as small as it could be.
maybe, I'd need to test, I wanted to do this to allow in-place edition of tags
The general idea is that browsers don't support multiple tracks, so by changing the default video/audio tracks on the fly in place one could change what track is being played, as browsers play the default track [usually], but I ran into this issue, because it all need to be done in place with the same length so transmuxing wasnt needed
another issue that i didn't fully track down, was that the default flags on tracks weren't being set [I THINK] but I never double checked if that was the case because of the issue described in this PR
would simply removing the if check fix this issue?
It probably won't hurt, but looking through the code I'm seeing some other potential problems. They may not be an issue if you aren't using Unknown-sized tags, though.
I'd like to get it working with any mkv file out there in the wild, I failed to fix this myself which is why I originally made this issue, hoping you'd be able to make a commit/release that fixes this
@austinleroy the code you link wasn't the problem, tho removing it did create different outputs 1st hash is original file 2nd is after change 3rd is before change
TLDR something changed, but the files that were fine before are still fine, and ones that were broken before are still broken
I found some other potential causes of changing file content due to there being multiple different valid VINT representations of the same value (e.g. "0x4001" and "0x81" are VINTs with the value "1") and differences in writing int/float lengths. I pushed a couple of updates to a branch "PreventUnintendedChanges" in this repo. If you want to test with that it may be better for your use case. I don't think the hashes will match yet (there are still some discrepancies in parsed vs written float values). But the lengths should match, at least.
sorry I think I got lost in meaning, by "exact length" I meant "tags are in the same positions", because i'd prefer to create in-place manipulations rather than reencode the entire stream, but if this works, then i'd be fine with it //testing
oh god, i realised how flawed my understanding of streams was!!!!!! i completly forgot about backpressure, so half these tests wont work correctly, oopsey!
this is very very haphazard, but here's a test implementation:
import { createReadStream, readFileSync } from 'fs'
import { EbmlStreamDecoder, EbmlStreamEncoder } from '../lib/index.js'
import { pipeline as _p } from 'stream'
import { promisify } from 'util'
import { createHash } from 'crypto'
const pipeline = promisify(_p)
const files = ['video1.webm', 'video2.webm', 'video3.webm', 'video4.webm', 'test5.mkv']
for (const file of files) {
console.log(file)
const hasher1 = createHash('sha1').setEncoding('hex')
const stream1 = createReadStream('media/' + file)
await pipeline(stream1, hasher1)
console.log(hasher1.read())
const hasher2 = createHash('sha1').setEncoding('hex')
const stream2 = createReadStream('media/' + file)
await pipeline(
stream2,
new EbmlStreamDecoder(),
new EbmlStreamEncoder(),
hasher2
)
console.log(hasher2.read())
const filestream = createReadStream('media/' + file)
const pip = _p(
[
filestream,
new EbmlStreamDecoder(),
new EbmlStreamEncoder()
], () => { }
)
const res = await stream2buffer(pip)
const data = readFileSync('media/' + file)
console.log(res.length === data.length)
}
function stream2buffer (stream) {
return new Promise((resolve, reject) => {
const _buf = []
stream.on('data', (chunk) => _buf.push(chunk))
stream.on('end', () => resolve(Buffer.concat(_buf)))
stream.on('error', (err) => reject(err))
})
}
drop this in ./test/file.js
then run it with node ./test/file.js
this seems to work fine for webm's but fails for mkv's:
I also have an expanded list of EbmlTags that should include most mkv tags
EbmlTagId.js I force pushed and overwrote all these tags, I want to cry, all my work
here's the test video i used for mkv: https://files.catbox.moe/zznt0e.mkv
rest of the videos are just your webms renamed to video1 ...etc
I have very very little time to work on this, so I hope this is enough for you to get reproduction steps, I am however immensely interested in this issue, because it has the potential for solving one of the longest standing issues on the web video: which is lack of multi-track audio and video support, currently to solve this issue videos are muxxed on the server, which is intensive for the servers, if this instead could be done on the client with js with this library it would offer much better scaling for such services
I hope you don't take this as "he just wants to throw work at me" I am thankful you even looked at this <3
I created a mapping of all ebml tags according to spec, then overwrote all the work I did a few days back, I want to punch something, sorry that js/txt file i sent isnt actually helpful
ebmltags.json here's the actual list of all matroska tags in a json file, note that the id is a hex value in string form so you'll need some parsing these are taken from the official documentation of matroska
do note that these duplicate with webm's so they need to be de-duped, which is what i had done before and is the progress i lost
running a file tru the decoder then the encoder again doesn't always yield the same results, sometimes it does sometimes it doesnt
reproduction code:
yields different outputs, mainly different lengths, I've identified this to be master tags most of the time, specifically
Segment
andEBML
notably from the
media/
folder,audiosample.webm
yields a different length after reencoding, whilevideo-webm-codecs-avc1-42E01E.webm
yields the same resultis this caused by corrupted blocks?