BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
865 stars 65 forks source link

download the training data #34

Closed UestcJay closed 4 months ago

UestcJay commented 5 months ago

Thanks for your great work! when I use modelscope python api to download training dataset, I failed:

>>> from modelscope.msdatasets import MsDataset
2024-03-20 14:56:10,539 - modelscope - INFO - PyTorch version 2.2.0+cu118 Found.
2024-03-20 14:56:10,542 - modelscope - INFO - Loading ast index from /mnt/afs1/likeqiang/.cache/modelscope/ast_indexer
2024-03-20 14:56:10,957 - modelscope - INFO - Loading done! Current index file version is 1.13.1, with md5 ac6c5f948b02361aa74e8bd
58f64a6f7 and a total number of 972 components indexed
>>> ds =  MsDataset.load('BoyaWu10/Bunny-v1.0-data')
2024-03-20 14:56:21,614 - modelscope - INFO - No subset_name specified, defaulting to the default
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/afs1/likeqiang/miniconda3/envs/bunny/lib/python3.10/site-packages/modelscope/msdatasets/ms_dataset.py", line 284, in 
load
    dataset_inst = remote_dataloader_manager.load_dataset(
  File "/mnt/afs1/likeqiang/miniconda3/envs/bunny/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader_manag
er.py", line 132, in load_dataset
    oss_downloader.process()
  File "/mnt/afs1/likeqiang/miniconda3/envs/bunny/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader.py", 
line 83, in process
    self._prepare_and_download()
  File "/mnt/afs1/likeqiang/miniconda3/envs/bunny/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader.py", 
line 132, in _prepare_and_download
    raise f'meta-file: {dataset_name}.py not found on the modelscope hub.'
TypeError: exceptions must derive from BaseException

when I use git clone directly, it shows:

Cloning into 'Bunny-v1.0-data'...
remote: Enumerating objects: 50, done.
remote: Counting objects: 100% (50/50), done.
remote: Compressing objects: 100% (35/35), done.
remote: Total 50 (delta 17), reused 43 (delta 13), pack-reused 0
Unpacking objects: 100% (50/50), 6.23 KiB | 25.00 KiB/s, done.
Filtering content: 100% (11/11), 18.76 GiB | 5.17 MiB/s, done.
Encountered 9 files that may not have been copied correctly on Windows:
        finetune/images.tar.gz.part-ad
        pretrain/images.tar.gz.part-aa
        finetune/images.tar.gz.part-ac
        finetune/images.tar.gz.part-ab
        pretrain/images.tar.gz.part-ae
        pretrain/images.tar.gz.part-ac
        pretrain/images.tar.gz.part-ab
        pretrain/images.tar.gz.part-ad
        finetune/images.tar.gz.part-aa

could you give me some advice? or can you upload to huggingface?

BoyaWu10 commented 5 months ago

Hi @UestcJay, thanks for sharing this. The git log looks fine to me. After downloading all the part files, you'll need to run the following command in pretrain folder and finetune folder respectively, to combine the image packages into one:

cat images.tar.gz.part-* > images.tar.gz

This is because we split the images into multiple packages to make the uploading process more stable.

UestcJay commented 5 months ago

hi, I reproduce the traning with using the bunny dataset, the response of the model is neither an answer starting with yes or no when eval mme. Will your model be like this? how can i eval mme?

question:

Is this artwork created by gentile da fabriano? Please answer yes or no.

warning: Setting pad_token_id to eos_token_id:50256 for open-end generation.

response:

the artwork in question is not created by gentile da fabriano gentile da fabriano was an italian painter active 
in the early renaissance, known for his work in florence the style of the painting, with its gold leaf background and the particular rendering of the figures, is more indicative of the work of artists from the late gothic period, such as fra angelico or giotto, who were active in the early 15th century the use of gold leaf and the specific iconography of the virgin mary and child are also more characteristic of the early renaissance, which followed the gothic period therefore, the correct answer to the question is  no, this artwork is not created by gentile da fabriano
Isaachhh commented 5 months ago

Evaluation

Please share more information.

And the warning shouldn't occur due to here. Please check your code version.

UestcJay commented 5 months ago

Okay, I probably forgot to add this line of code, I used the bunny data set for two stages of pretraining and full-parameter sft. The pretraining stage froze vit and llm, and the sft stage froze vit. The training strategies are the same. However, when I evaluated mme, there were problems like the example above. The model I trained would not answer starting with yes or no, the perception score is 1250, this should be abnormal, right? I don't know which step has the problem. Should all answer of your models start with yes or no?

Isaachhh commented 5 months ago

Please use our code to train and evaluate the models.

"Should all answer of your models start with yes or no?" Yes.

Isaachhh commented 4 months ago

Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.