Closed LinB203 closed 7 months ago
The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.
The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.
Thanks for your patient reply.
That would be much appreciated if it could be uploaded to hugging face. the OneDrive link is too unstable.
The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.
That would be much appreciated if it could be uploaded to hugging face. the OneDrive link is too unstable.
The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.
Sure let me do so. But uploading to HF is also troublesome, sometimes the git process just failed when file exceeds 100G...
That would be much appreciated if it could be uploaded to hugging face. the OneDrive link is too unstable.
The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.
Sure let me do so. But uploading to HF is also troublesome, sometimes the git process just failed when file exceeds 100G...
I think splitting multiple parts like DC
does is a good solution! Thanks again for your generous contribution.
Hello! We are now uploading DC
and E4D
to the huggingface dataset. Expected to be completed within 24 hours.
And you can check this code to learn how to load the dataset (instruction.json
+ images.parquet
) and check the integrity of the whole dataset.
Thank you!
That's awesome!
E4D is updated here: https://huggingface.co/datasets/pufanyi/MIMICIT/tree/main/data/E4D
Thanks for your great contribution.
It seems that VST is also mismatched, although VST parquets were updated 7 days ago.
It seems that VST is also mismatched, although VST parquets were updated 7 days ago.
oh? Doesn't it have 32,893 instruction pairs?
It seems that VST is also mismatched, although VST parquets were updated 7 days ago.
oh? Doesn't it have 32,893 instruction pairs?
I use this code to valid. The parquet in VST is lacked.
It seems that VST is also mismatched, although VST parquets were updated 7 days ago.
oh? Doesn't it have 32,893 instruction pairs?
indeed 32893 samples, but 9027 missing_image_ids
By the way, some size seems mismatched with those in the paper, which is inconvenient to check the integrity of the whole dataset.
The size mismatch may come from we iteratively cleaned the dataset after submission. We will update the paper later when numbers are fully confirmed.
Let me check the VST's missing image_ids then.
The size mismatch may come from we iteratively cleaned the dataset after submission. We will update the paper later when numbers are fully confirmed.
Let me check the VST's missing image_ids then.
okk thank you very much
I'll also check my downloaded files again!
I've checked E4D CGD SD SN TVC. And I'm re-downloading VST and LA to check again. LA is failed to load.
@Luodian Hi, LA parquet loadding meets error.
I've downloaded for twice.
尽管 VST parquets 已于 7 天前更新,但似乎 VST 也不匹配。
哦?它不是有 32,893 个指令对吗?
确实有 32893 个样本,但有 9027 个 Missing_image_ids
I've encountered missing files as well, but as far as I can remember there seems to be only a little bit (maybe a few thousand), so I've ignored it.
@Luodian Hi, LA parquet loadding meets error. I've downloaded for twice.
Please take a look at this issue, iteratively loading would address it.
which issue 555
I used this
def load_parquet_file(file_path_list):
image_dfs = []
for file_path in file_path_list:
logger.info(f"Loading parquet file: {file_path}")
image_df = dd.read_parquet(file_path, engine="pyarrow").compute()
image_dfs.append(image_df)
logger.info(f"Concating parquet files: {file_path_list}")
image_df = pd.concat(image_dfs)
return image_df
This: #320
okkkk, i see
@Luodian Thanks very very very very much, LA is okay. Now only VST lacks some samples.
Again, thanks for your brilliant work and repo and your help, which teaches me a lot!
As the mentioned in paper, the MIMIC-IT dataset has 2.2M instruction qa. But I have downloaded all
x_instruction.json
from Hugging Face. The total number of instruction qa is 1171k. Anything I miss?VST 32k image-qa LA 256k image-qa SN 6k image-qa SD 16k image-qa CGD 141k image-qa E4D 527k video-qa DC 56k video-qa TVC 137k video-qa
In a word, 451k image-qa & 720k video-qa, which 1171k qa totally.