aehrc / cxrmate

CXRMate: Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation
https://huggingface.co/aehrc/cxrmate
Apache License 2.0
14 stars 3 forks source link

Unknown Error in Train #17

Open oscarloch opened 1 month ago

oscarloch commented 1 month ago

Hi, I was trying to run the training and I got this error.
Have you run into this before?, Do you know what might be the source of the Error?

(cxrmate_env) oscarloch@cluster:~/cxr_models/cxrmate$ dlhpcstarter -t cxrmate -c config/train/longitudinal_gt_prompt_tf.yaml --stages_module tools.stages --train
Seed set to 0
PTL no. devices: 1.
PTL no. nodes: 1.
/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/fabric/connector.py:571: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py:512: You passed `Trainer(accelerator='cpu', precision='16-mixed')` but AMP with fp16 is not supported on CPU. Using `precision='bf16-mixed'` instead.
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Description, Special token, Index
bos_token, [BOS], 1
eos_token, [EOS], 2
unk_token, [UNK], 0
sep_token, [SEP], 3
pad_token, [PAD], 4
cls_token, [BOS], 1
mask_token, [MASK], 5
additional_special_token, [NF], 6
additional_special_token, [NI], 7
additional_special_token, [PMT], 8
additional_special_token, [PMT-SEP], 9
additional_special_token, [NPF], 10
additional_special_token, [NPI], 11
VisionEncoderDecoderModel has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
trainable params: 147,456 || all params: 80,916,528 || trainable%: 0.1822
/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead.
  warnings.warn(
/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/dlhpcstarter/utils.py:423: UserWarning: The "last" checkpoint does not exist, starting training from epoch 0.
  warnings.warn(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/bin/dlhpcstarter", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/dlhpcstarter/__main__.py", line 126, in main
[rank0]:     submit(args, cmd_line_args, stages_fnc)
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/dlhpcstarter/__main__.py", line 21, in submit
[rank0]:     stages_fnc(args)
[rank0]:   File "/home/oscarloch/cxr_models/cxrmate/tools/stages.py", line 89, in stages
[rank0]:     trainer.fit(model, ckpt_path=ckpt_path)
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 941, in _run
[rank0]:     self._data_connector.prepare_data()
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 100, in prepare_data
[rank0]:     call._call_lightning_module_hook(trainer, "prepare_data")
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/cxr_models/cxrmate/modules/lightning_modules/single.py", line 304, in prepare_data
[rank0]:     splits = pd.read_csv(splits_path)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
[rank0]:     return _read(filepath_or_buffer, kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 626, in _read
[rank0]:     return parser.read(nrows)
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1968, in read
[rank0]:     df = DataFrame(
[rank0]:          ^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/core/frame.py", line 778, in __init__
[rank0]:     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/core/internals/construction.py", line 443, in dict_to_mgr
[rank0]:     arrays = Series(data, index=columns, dtype=object)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/core/series.py", line 490, in __init__
[rank0]:     index = ensure_index(index)
[rank0]:             ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 7647, in ensure_index
[rank0]:     return Index(index_like, copy=copy, tupleize_cols=False)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 565, in __new__
[rank0]:     arr = sanitize_array(data, None, dtype=dtype, copy=copy)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/core/construction.py", line 654, in sanitize_array
[rank0]:     subarr = maybe_convert_platform(data)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/oscarloch/miniconda3/envs/cxrmate_env/lib/python3.12/site-packages/pandas/core/dtypes/cast.py", line 138, in maybe_convert_platform
[rank0]:     arr = lib.maybe_convert_objects(arr)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "lib.pyx", line 2538, in pandas._libs.lib.maybe_convert_objects
[rank0]: TypeError: Cannot convert numpy.ndarray to numpy.ndarray

Thank you so much for your help!

anicolson commented 1 month ago

Hi @oscarloch,

So the error is originating from here:

[rank0]: File "/home/oscarloch/cxr_models/cxrmate/modules/lightning_modules/single.py", line 304, in prepare_data [rank0]: splits = pd.read_csv(splits_path)

i.e.: https://github.com/aehrc/cxrmate/blob/b106927021e7037e4198bdc1dd36524c227303c8/modules/lightning_modules/single.py#L304

Can you put a breakpoint there and see what is going on? It is simply loading a csv file there. Can you check if the path is right and if the file is not corrupted?

Also try reading the .csv file with pandas outside of cxrmate.