clinicalml / TabLLM

MIT License
265 stars 42 forks source link

Can TabLLM run on Windows? #26

Closed yitongshang2021 closed 1 month ago

yitongshang2021 commented 1 month ago

Dear author, I can't get dev_scores.json and other files when I run tfew code. Can TabLLM only run on Linux? Thanks!

stefanhgm commented 1 month ago

Hello @yitongshang2021,

Thanks for using TabLLM! I only tested the environment on Linux, but I am pretty sure it should also run on Windows. It could maybe be helpful if you specify your error in a bit more detail. Does the entire program run without errors and only the mentioned files are not generated?

Another option might be to use a Colab notebook or other online instance that runs Linux.

Hope that helps!

Stefan

yitongshang2021 commented 1 month ago

Hello @stefanhgm, Thanks for your message. In the tfew environment, due to the deepspeed==0.5.10 hard to install in the windows, i use the ==0.3.16, and installed successfully.

However, when I run the following code, the error is get, can you help me?

(tfew) C:\Users\78166\t-few>set HF_HOME=%USERPROFILE%.cache\huggingface

(tfew) C:\Users\78166\t-few>set CUDA_VISIBLE_DEVICES=0

(tfew) C:\Users\78166\t-few>python -m src.pl_train -c t03b.json+rte.json -k save_model=False exp_name=first_exp Start experiment first_exp { "exp_dir": "exp_out\first_exp", "exp_name": "first_exp", "allow_skip_exp": true, "seed": 42, "model": "EncDec", "max_seq_len": 256, "origin_model": "bigscience/T0_3B", "load_weight": "", "dataset": "rte", "few_shot": true, "num_shot": 32, "few_shot_random_seed": 100, "train_template_idx": -1, "eval_template_idx": -1, "batch_size": 8, "eval_batch_size": 16, "num_workers": 8, "change_hswag_templates": false, "raft_cross_validation": true, "raft_validation_start": 0, "raft_labels_in_input_string": "comma", "cleaned_answer_choices_b77": false, "compute_precision": "bf16", "compute_strategy": "none", "num_steps": 300, "eval_epoch_interval": 10000, "eval_before_training": true, "save_model": false, "save_step_interval": 20000, "mc_loss": 1, "unlikely_loss": 1, "length_norm": 1, "grad_accum_factor": 1, "split_option_at_inference": false, "optimizer": "adafactor", "lr": 0.0003, "trainable_param_names": ".", "scheduler": "linear_decay_with_warmup", "warmup_ratio": 0.06, "weight_decay": 0, "scale_parameter": true, "grad_clip_norm": 1, "model_modifier": "", "prompt_tuning_num_prefix_emb": 100, "prompt_tuning_encoder": true, "prompt_tuning_decoder": true, "lora_rank": 4, "lora_scaling_rank": 0, "lora_init_scale": 0.01, "lora_modules": "none", "lora_layers": "none", "bitfit_modules": ".", "bitfitlayers": "q|k|v|o|wi[01]|w_o", "adapter_type": "normal", "adapter_non_linearity": "relu", "adapter_reduction_factor": 4, "normal_adapter_residual": true, "lowrank_adapter_w_init": "glorot-uniform", "lowrank_adapter_rank": 1, "compacter_hypercomplex_division": 8, "compacter_learn_phm": true, "compacter_hypercomplex_nonlinearity": "glorot-uniform", "compacter_shared_phm_rule": false, "compacter_factorized_phm": false, "compacter_shared_W_phm": false, "compacter_factorized_phm_rule": false, "compacter_phm_c_init": "normal", "compacter_phm_rank": 1, "compacter_phm_init_range": 0.01, "compacter_kronecker_prod": false, "compacter_add_compacter_in_self_attention": false, "compacter_add_compacter_in_cross_attention": false, "intrinsic_projection": "fastfood", "intrinsic_said": true, "intrinsic_dim": 2000, "intrinsic_device": "cpu", "fishmask_mode": null, "fishmask_path": null, "fishmask_keep_ratio": 0.05, "prefix_tuning_num_input_tokens": 10, "prefix_tuning_num_target_tokens": 10, "prefix_tuning_init_path": null, "prefix_tuning_init_text": null, "prefix_tuning_parameterization": "mlp-512", "train_pred_file": "exp_out\first_exp\train_pred.txt", "dev_pred_file": "exp_out\first_exp\dev_pred.txt", "dev_score_file": "exp_out\first_exp\dev_scores.json", "test_pred_file": "exp_out\first_exp\test_pred.txt", "test_score_file": "exp_out\first_exp\test_scores.json", "finish_flag_file": "exp_out\first_exp\exp_completed.txt" } Mark experiment first_exp as claimed Traceback (most recent call last): File "C:\Users\78166\anaconda3\envs\tfew\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "C:\Users\78166\anaconda3\envs\tfew\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\78166\t-few\src\pl_train.py", line 86, in main(config) File "C:\Users\78166\t-few\src\pl_train.py", line 33, in main tokenizer, model = get_transformer(config) File "C:\Users\78166\t-few\src\pl_train.py", line 17, in get_transformer tokenizer = AutoTokenizer.from_pretrained(config.origin_model) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 481, in from_pretrained tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, kwargs) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 350, in get_tokenizer_config use_auth_token=use_auth_token, File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\transformers\file_utils.py", line 1784, in cached_path local_files_only=local_files_only, File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\transformers\file_utils.py", line 1947, in get_from_cache r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\api.py", line 100, in head return request("head", url, kwargs) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\requests\adapters.py", line 497, in send chunked=chunked, File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\connectionpool.py", line 696, in urlopen self._prepare_proxy(conn) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\connectionpool.py", line 964, in _prepare_proxy conn.connect() File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\connection.py", line 359, in connect conn = self._connect_tls_proxy(hostname, conn) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\connection.py", line 506, in _connect_tls_proxy ssl_context=sslcontext, File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\util\ssl.py", line 453, in ssl_wrap_socket ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_intls) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\urllib3\util\ssl.py", line 495, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock) File "C:\Users\78166\anaconda3\envs\tfew\lib\ssl.py", line 412, in wrap_socket session=session File "C:\Users\78166\anaconda3\envs\tfew\lib\ssl.py", line 807, in _create raise ValueError("check_hostname requires server_hostname") ValueError: check_hostname requires server_hostname

stefanhgm commented 1 month ago

Hi @yitongshang2021,

this looks as if the program already fails while downloading the huggingface model (see line tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)).

Unfortunately, this could be due to a variety of reasons. You could try running a minimal huggingface examples first or searching for solutions for the error ValueError: check_hostname requires server_hostname on Google. Maybe this helps: https://stackoverflow.com/questions/67297278/valueerror-check-hostname-requires-server-hostname ?

Hope that helps!

yitongshang2021 commented 1 month ago

Hi @stefanhgm! Thanks again for your message. With the help of your information, I have sloved the above bug. However, a new bug has been appeared, as following, I would really appreciate it if you could give me some insight.

(tfew) C:\Users\78166\t-few-master>python -m src.pl_train -c t03b.json+rte.json -k save_model=False exp_name=first_exp Start experiment first_exp { "exp_dir": "exp_out\first_exp", "exp_name": "first_exp", "allow_skip_exp": true, "seed": 42, "model": "EncDec", "max_seq_len": 256, "origin_model": "bigscience/T0_3B", "load_weight": "", "dataset": "rte", "few_shot": true, "num_shot": 32, "few_shot_random_seed": 100, "train_template_idx": -1, "eval_template_idx": -1, "batch_size": 8, "eval_batch_size": 16, "num_workers": 8, "change_hswag_templates": false, "raft_cross_validation": true, "raft_validation_start": 0, "raft_labels_in_input_string": "comma", "cleaned_answer_choices_b77": false, "compute_precision": "bf16", "compute_strategy": "none", "num_steps": 300, "eval_epoch_interval": 10000, "eval_before_training": true, "save_model": false, "save_step_interval": 20000, "mc_loss": 1, "unlikely_loss": 1, "length_norm": 1, "grad_accum_factor": 1, "split_option_at_inference": false, "optimizer": "adafactor", "lr": 0.0003, "trainable_param_names": ".", "scheduler": "linear_decay_with_warmup", "warmup_ratio": 0.06, "weight_decay": 0, "scale_parameter": true, "grad_clip_norm": 1, "model_modifier": "", "prompt_tuning_num_prefix_emb": 100, "prompt_tuning_encoder": true, "prompt_tuning_decoder": true, "lora_rank": 4, "lora_scaling_rank": 0, "lora_init_scale": 0.01, "lora_modules": "none", "lora_layers": "none", "bitfit_modules": ".", "bitfitlayers": "q|k|v|o|wi[01]|w_o", "adapter_type": "normal", "adapter_non_linearity": "relu", "adapter_reduction_factor": 4, "normal_adapter_residual": true, "lowrank_adapter_w_init": "glorot-uniform", "lowrank_adapter_rank": 1, "compacter_hypercomplex_division": 8, "compacter_learn_phm": true, "compacter_hypercomplex_nonlinearity": "glorot-uniform", "compacter_shared_phm_rule": false, "compacter_factorized_phm": false, "compacter_shared_W_phm": false, "compacter_factorized_phm_rule": false, "compacter_phm_c_init": "normal", "compacter_phm_rank": 1, "compacter_phm_init_range": 0.01, "compacter_kronecker_prod": false, "compacter_add_compacter_in_self_attention": false, "compacter_add_compacter_in_cross_attention": false, "intrinsic_projection": "fastfood", "intrinsic_said": true, "intrinsic_dim": 2000, "intrinsic_device": "cpu", "fishmask_mode": null, "fishmask_path": null, "fishmask_keep_ratio": 0.05, "prefix_tuning_num_input_tokens": 10, "prefix_tuning_num_target_tokens": 10, "prefix_tuning_init_path": null, "prefix_tuning_init_text": null, "prefix_tuning_parameterization": "mlp-512", "train_pred_file": "exp_out\first_exp\train_pred.txt", "dev_pred_file": "exp_out\first_exp\dev_pred.txt", "dev_score_file": "exp_out\first_exp\dev_scores.json", "test_pred_file": "exp_out\first_exp\test_pred.txt", "test_score_file": "exp_out\first_exp\test_scores.json", "finish_flag_file": "exp_out\first_exp\exp_completed.txt" } Mark experiment first_exp as claimed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs Reusing dataset super_glue (C:\Users\78166.cache\huggingface\super_glue\rte\1.0.2\d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7) Train size 32 Eval size 277 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\utilities\model_summary.py:127: RuntimeWarning: overflow encountered in long_scalars return sum(np.prod(p.shape) if not _is_lazy_weight_tensor(p) else 0 for p in self._module.parameters()) Traceback (most recent call last): File "C:\Users\78166\anaconda3\envs\tfew\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "C:\Users\78166\anaconda3\envs\tfew\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\78166\t-few-master\src\pl_train.py", line 89, in main(config) File "C:\Users\78166\t-few-master\src\pl_train.py", line 60, in main trainer.fit(model, datamodule) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 741, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 685, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 777, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1199, in _run self._dispatch() File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1279, in _dispatch self.training_type_plugin.start_training(self) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 202, in start_training self._results = trainer.run_stage() File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1289, in run_stage return self._run_train() File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1306, in _run_train self._pre_training_routine() File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1301, in _pre_training_routine self.call_hook("on_pretrain_routine_start") File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1495, in call_hook callback_fx(args, **kwargs) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 148, in on_pretrain_routine_start callback.on_pretrain_routine_start(self, self.lightning_module) File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\callbacks\model_summary.py", line 57, in on_pretrain_routine_start summary_data = model_summary._get_summary_data() File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\utilities\model_summary.py", line 316, in _get_summary_data ("Params", list(map(get_human_readable_count, self.param_nums))), File "C:\Users\78166\anaconda3\envs\tfew\lib\site-packages\pytorch_lightning\utilities\model_summary.py", line 419, in get_human_readable_count assert number >= 0 AssertionError

yitongshang2021 commented 1 month ago

Hi @stefanhgm, I have run the demo on the linux. It's a very helpful code! According to my experience, it is difficult to run successfully on Windows, especially the compatibility issues of some libraries. Thank you for your help.