question about data preprocessing

IBM / multidoc2dial

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents

Apache License 2.0

67 stars 22 forks source link

question about data preprocessing #14

Open gouqi666 opened 2 years ago

gouqi666 commented 2 years ago

Hi, when I try to run your code, I found I can't download datasets by using load_dataset . The error is "HF google storage unreachable. Downloading and preparing it from source" . Although I have used vpn, the problem is stil here. So I want to download data manually, But I found the data is mismatched in some field. Could u help me? thanks.

songfeng commented 2 years ago

Thank you for the question! If you download the data manually and save it locally, then you can specify DATA_DIR here. You should be fine. There is also sample code for loading the dataset in here.

We updated the test split of MultiDoc2Dial dataset recently, which caused some issue using Huggingface load_dataset. @sivasankalpp Could you please follow up on this? Thank you very much!

sivasankalpp commented 2 years ago

Hi @gouqi666, happy to help! Can you share the command you tried to download the dataset?

songfeng commented 2 years ago

Hi @gouqi666, happy to help! Can you share the command you tried to download the dataset?

Hi @sivasankalpp , I meant that the huggingface multdoc2dial. I think there's need to run their cmd to update dataset_infos.json and then test if load_dataset works with the latest multidoc2dial download. This should resolve the data loading issue reported here.

gouqi666 commented 2 years ago

And in model_convert.py --> retriever = RagRetriever(model.config, question_encoder_tokenizer, generator_tokenizer) shows ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.16.1/datasets/wiki_dpr/wiki_dpr.py.

songfeng commented 2 years ago

The link works for me.