HaozheZhao / MIC

MMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU
333 stars 15 forks source link

原始数据组织 #2

Closed yytzsy closed 1 year ago

yytzsy commented 1 year ago

请问所用到的公开数据集的数据,应该怎样组织呢?

HaozheZhao commented 1 year ago

We convert all tasks into a QA-style format. For detailed information about the format of our MIC dataset, please refer to our dataset released on the huggingface hub.

yytzsy commented 1 year ago

谢谢!请问涉及到公开数据集的图片数据,可以提供你们下载的原始链接方便我们复现么?

HaozheZhao commented 1 year ago

Unfortunately, we are currently working on the license issue for uploading the image data publicly. Additionally, the dataset currently contains approximately 2M images and video files. We have also encountered some problems with uploading the data( 2M files, 300GB) to the huggingface hub. We are working on resolving this issue and expect to have it resolved in the coming weeks. If you want to have a look on the training data, you can try using our MIC_Sample data. Here, we provide a portion of our training data, where the image and video data have been converted into base64 encoded strings.

yytzsy commented 1 year ago

好的,如果直接上传数据比较困难,可否提供一个原始数据集的链接,即所用到的数据集的相关原始官方下载链接(也就是你们从哪里下载得到的这些图片的),比如COCO2014等等。十分感谢您的及时回复!

HaozheZhao commented 1 year ago

We have considered this approach and are currently working on organizing the data used and its sources. It is expected to be released in about a week. We will release a repository for dataset construction, which will include links to the data sources and instructions on how to utilize the repo to easily extend the exist open-source datasets to the MIC dataset.

yytzsy commented 1 year ago

赞!

HaozheZhao commented 1 year ago

: )

yotofu commented 1 year ago

抓紧抓紧,感谢感谢~

HaozheZhao commented 1 year ago

You can check out this repo MIC_tool. It can be used to transform the existing opensource datasets into the MIC dataset.

JiazhengChai commented 11 months ago

Unfortunately, we are currently working on the license issue for uploading the image data publicly. Additionally, the dataset currently contains approximately 2M images and video files. We have also encountered some problems with uploading the data( 2M files, 300GB) to the huggingface hub. We are working on resolving this issue and expect to have it resolved in the coming weeks. If you want to have a look on the training data, you can try using our MIC_Sample data. Here, we provide a portion of our training data, where the image and video data have been converted into base64 encoded strings.

@HaozheZhao Hi there, thank you for uploading the dataset to the huggingface hub. As mentioned above, I wonder what is the license of the dataset on huggingface? Can you explain why is it currently "unknown"? Thank you.

HaozheZhao commented 11 months ago

Unfortunately, we are currently working on the license issue for uploading the image data publicly. Additionally, the dataset currently contains approximately 2M images and video files. We have also encountered some problems with uploading the data( 2M files, 300GB) to the huggingface hub. We are working on resolving this issue and expect to have it resolved in the coming weeks. If you want to have a look on the training data, you can try using our MIC_Sample data. Here, we provide a portion of our training data, where the image and video data have been converted into base64 encoded strings.

@HaozheZhao Hi there, thank you for uploading the dataset to the huggingface hub. As mentioned above, I wonder what is the license of the dataset on huggingface? Can you explain why is it currently "unknown"? Thank you.

Hi, there is no doubt that we strongly respect the efforts of others in creating open source data, and data licensing is undoubtedly important. As the MIC dataset is constructed from open source data, it involves complex licensing relationships. We are still working to compile all the licenses related to the dataset and image sources to provide the license for our dataset. Therefore, the current license of the MIC dataset is temporally classified as unknown. If you find that our dataset violates any licensing regulations, we will make corrections as soon as possible.