SamsungLabs / StepFormer

Other
16 stars 3 forks source link

How to make shard files of the Howto100M dataset? #4

Open Y-Haneji opened 6 months ago

Y-Haneji commented 6 months ago

@hadjisma @lavenderrz Your code mentions a directory containing shard files formatted as tar. https://github.com/SamsungLabs/StepFormer/blob/31f62679536177e7bc8e132b5611ee596f427fab/data/tar_loader.py#L35

[Question] Could you give me a code to reproduce them or just a key-value pair to write in?

lavenderrz commented 6 months ago

Hi, video features in these tar files are created by extracting MIL-NCE features following https://github.com/ArrowLuo/VideoFeatureExtractor and followed by univl model.

Y-Haneji commented 6 months ago

Thank you for replying! I'll try it.

Y-Haneji commented 5 months ago

@lavenderrz You split the train/val by shard files. https://github.com/SamsungLabs/StepFormer/blob/31f62679536177e7bc8e132b5611ee596f427fab/data/tar_loader.py#L38

What's the unit of a shard file? Does a shard file correspond to ONE video by wds.ShardWriter(f'shards-{video_id}.tar') or SOME videos by wds.ShardWriter('shards-%05d.tar', maxsize=int(50 * 1000**2)) # 50MB?

HankKung commented 2 months ago

@Y-Haneji Hi have you tried to extract features with UniVL? Could you share the script of that? That will help a lot, thank you!

Y-Haneji commented 1 month ago

@HankKung No, I haven't. I tried another encoder and can't share the whole code about the ongoing research. Below is the pseudo-code, and I hope it helps you. Please ask the author more questions.

import webdataset as wds

with wds.ShardWriter("shard-%06d.tar", maxsize=5e8) as sink:  # 500MB
  for video in videos:
    shard = {
        "__key__": name,
        "pickle": {
            "video_features": video_features,
            "text_features": text_features,
            "json": annotations,
            "name": name,
        },
    }
    sink.write(shard)