Sporadic error during training

Phaips commented 1 month ago

Hi, I often get this error during training:

"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 192, in _read_header
    raise ValueError("Couldn't read enough bytes for MRC header")
ValueError: Couldn't read enough bytes for MRC header

which is then followed by this error:

FileNotFoundError: [Errno 2] No such file or directory: 
'home/recons/ddw/8/logs/version_0/checkpoints/epoch/epoch=999.ckpt

the jobs will continue the training by setting resume_from_checkpoint: ".." to the last val_loss checkpoint. The weird thing is that not all jobs fail like this, only sometimes, sporadically very early in the process or later or never. Re-running it once (or multiple times) eventually they finish (and produce amazing tomograms! :D)

Thanks for the help :) Cheers, Philippe

rdrighetto commented 1 month ago

Just wanted to note that the second error happens just because we have the prediction command (ddw refine-tomogram) in our script, so if training fails the checkpoint expected for prediction will not be there and it will fail too. The real error is the first one and indeed it seems to happen at random 🙁

SimWdm commented 1 month ago

Hello Philippe and Ricardo,

thanks for opening this issue!

The error you describe sounds like something is wrong with reading and/or writing the model input and target subtoms as mrc files. Unfortunately, I have no idea as to what that might be, and the fact that it happens at random makes it hard to debug, but I will try my best!

Is the error message you shared the complete one? If not, could you please share the full message? That might help me find a place to start.

Best, Simon

Phaips commented 1 month ago

Hi Simon,

Thanks for looking into it nevertheless! The entire error is this:

ValueError: Caught ValueError in DataLoader worker process 9.
Original Traceback (most recent call last):
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/su
btomo_dataset.py", line 61, in __getitem__
    subtomo0 = load_mrc_data(subtomo0_file)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/mr
ctools.py", line 12, in load_mrc_data
    with mrcfile.open(mrc_file, permissive=True) as mrc:
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/load_functio
ns.py", line 139, in open
    return NewMrc(name, mode=mode, permissive=permissive,
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcfile.py",
 line 115, in __init__
    self._read(header_only)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcfile.py",
 line 131, in _read
    super(MrcFile, self)._read(header_only)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 170, in _read
    self._read_header()
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 192, in _read_header
    raise ValueError("Couldn't read enough bytes for MRC header")
ValueError: Couldn't read enough bytes for MRC header

and actually I just saw another one but which seems to originate from the same problem:

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/su
btomo_dataset.py", line 61, in __getitem__
    subtomo0 = load_mrc_data(subtomo0_file)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/mr
ctools.py", line 14, in load_mrc_data
    data = torch.tensor(mrc.data)
RuntimeError: Could not infer dtype of NoneType

Cheers, Philippe

SimWdm commented 1 month ago

Hi Philippe,

thank you for sharing the full error messages. The problem is indeed related to reading and writing the headers of the subtomogram mrc files. For that DDW uses the mrcfile package, and, as far as some research online has shown, other people have already had similar-seeming issues with mrcfile and headers. I will further look into this today and tomorrow and will let you know as soon as I have made any progress on this.

In the meantime, if you are still using DDW and encounter another mrc-header-related error, it would be great if you could find out which subtomogram is causing the error and send me the corrupted file. This would be helpful for debugging, since I don't know what exactly is causing the header error.

To find the corrupted subtomogram after the error has occurred, you could for example try to open all subtomograms in your DDW fitting data directory (e.g. PATH_TO_DDW_PROJECT/subtomos/fitting_subtomos/) directory in Python using the mrcfile package, and at least one should give you a header error. Let me know if you need help with this.

Best, Simon

SimWdm commented 1 month ago

Hi Philippe and Ricardo,

just a quick heads up: After thinking about this issue for a while, I think that the cleanest solution would be to save the subtomograms as .pt files using PyTorch, which gets rid of .mrc headers altogether. I have already implemented this new strategy in a separate branch and am currently running a sanity check model fitting to make sure that the changes have not broken anything. Once the check is successful, you can test the changes on your data and if your problem is solved, I will merge the changes into main.

Best, Simon

Phaips commented 1 month ago

Dear Simon,

Thanks a lot! I think the .pt seems like a good fix! Happy to test it as soon as it passes the checks. I tried to read all the fitting_subtomos as well as the val_subtomos from a job which encountered the error from above. However, this error did not pop up and I was able to read in all the .mrc files in python without an error...

Kind regards, Philippe

SimWdm commented 1 month ago

Dear Philippe,

I have just pushed the version with the .pt subtomos to a new branch called torch_subtomos. You can test this version by switching to the torch_subtomos branch and reinstalling DDW (re-running pip install . inside the DeepDeWedge/ directory should be sufficient). If your installation was successful, all subtomos should be saved with a .pt extension.

If this fix has resolved your issue, I will merge the changes into the main branch.

Please let me know if you have any questions or problems!

Best, Simon

rdrighetto commented 1 month ago

Thanks a lot Simon! We will test it.

If I may give some extra feedback on this: as far as we know the issue happens at random, which I guess is related to the very intensive I/O when dealing with many subtomogram files. I'm not sure if just changing the file format will prevent this from happening again. My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail. Another possibility is related to our filesystem. The script we normally use to run ddw copies the entire "project" to a local SSD in the computing node to optimize performance. It could be some random failure in this process. Now I'm curious to see if the problem also happens when not using the SSD but only the cluster parallel filesystem. The drawback is that will certainly be slower. Anyway, happy to test your solution!

SimWdm commented 1 month ago

My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail.

That's an interesting point! If we indeed get random NaN or None values, I would probably have to implement some kind of checking routine that tries to load sub-tomograms and re-saves them if loading fails or something in that direction.

Anyway, happy to test your solution!

I am very curious about the outcome of your tests! Let me know if you have any problems.

RHennellJames commented 3 weeks ago

I have also had this error with fit-model in the last week. My scripts overwrote the log file, but when I ran on the same tomogram several times it happened at various different epochs from 250-850. Most of the errors happened closer to epoch 850 to 250, which I think suggests that it's a random event that has a chance of happening every epoch (maybe more likely as time goes on). I'm interested to hear if the new branch fixed it.

I previously ran without any training issues on a different dataset, but with refine-tomogram I kept getting an "Errno 28: no space left on device error." I managed to fix that by setting TMPDIR (in our slurm system) to be in my filespace on our gpfs storage system instead of on the node.

SimWdm commented 3 weeks ago

Thanks for sharing @RHennellJames! The fact that you get the same error suggests that the issue does not only occur for a certain hardware setup, which is good to know.

@Phaips @rdrighetto did switching to the torch subtomos resolve the issue in your experiments?

rdrighetto commented 2 weeks ago

Hi @SimWdm!

We finally got to test the torch_subtomos branch. The good news is that it works. However, in a comparison with the vanilla code based on MRC subtomograms with identical settings, including the same random seed and running on the same hardware, the original code was noticeably faster (461min vs 521min when fitting for 1000 epochs). Also, while both results look good (predicting using the model from the last epoch), they are not identical, which puzzles me. They should look exactly the same, no? Am I missing something?

To summarize, the new code works, but it seems to be slower and we cannot really say whether it solves the issue, given the random nature of the problem. We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this!

RHennellJames commented 2 weeks ago

Thanks for the update @rdrighetto. I will ask our Scientific Computing manager to install it here and see if it solves the problem for me as well.

Best, Rory

SimWdm commented 2 weeks ago

Thanks for testing the fix @rdrighetto!

However, in a comparison with the vanilla code based on MRC subtomograms with identical settings, including the same random seed and running on the same hardware, the original code was noticeably faster (461min vs 521min when fitting for 1000 epochs).

The longer runtime could be due to less efficient reading/writing of the torch subtomos. I have to test this in an isolated environment to be on the save side, and I will get back to you once I know more!

Also, while both results look good (predicting using the model from the last epoch), they are not identical, which puzzles me. They should look exactly the same, no? Am I missing something?

That is another interesting observation! As the seed was the same in both runs, I would have expected identical results as well! Do training and validation curves look different as well? Have you tried to exactly reproduce results with the .mrc subtomos before? If so, was it successful?

We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this!

Have a good time at the lab retreat! I am sure we'll eventually sort out all these subtomo-related problems! 🙂 🙏🏼

RHennellJames commented 2 weeks ago

It seems that the torch branch also fixed the problem for me, at least on the dateset where I had the issue before.

Thanks very much for sorting this!

SimWdm commented 2 weeks ago

That's great news @RHennellJames! Was the model fitting slower for you as well?

RHennellJames commented 2 weeks ago

Hi @SimWdm, the per-epoch time looked to be the same. I'm not sure if the total time was longer as it never got to completion for this dataset with the old code

rdrighetto commented 2 weeks ago

Hi @SimWdm,

Do training and validation curves look different as well?

They look very similar but not identical:

vanilla test_vanilla torch_subtomos test_torch_subtomos

Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?

Have you tried to exactly reproduce results with the .mrc subtomos before? If so, was it successful?

I am carrying out this test right now, both for the original .mrc subtomos as well as the torch_subtomos, will report back soon 😉

SimWdm commented 2 weeks ago

Thanks for the update @rdrighetto!

Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?

Initially, I thought that everything should be seeded, but your question made me check again and I noticed that the random rotation of the sub-tomograms during model fitting are indeed not seeded (see ddw.utils.subtomo_dataset.SubtomoDataset).

This should be easy to fix. For clarity, I will open another issue on the reproducibility and link this one. 🤓

Thank you for paying close attention to your experiments, which helps un-cover these nasty little flaws! 🙂

Edit: The new issue is here https://github.com/MLI-lab/DeepDeWedge/issues/9#issue-2511502473

rdrighetto commented 1 week ago

I was running each test (MRC vs. torch subtomos) identically one more time, and in this second run, the MRC subtomos have ran into the sporadic error while the torch_subtomos branch ran to completion. So now I'm more confident that the torch subtomos are more stable than the MRC files (but to be honest not yet 100% sure). Regarding performance, the torch subtomos now gave the best performance overall: 513min, so I don't think there's any fundamental bottleneck in comparison to the old code for MRC subtomos -- probably just random fluctuations of our hardware :-) Regarding the reproducibility, I do have an interesting observation but will continue that discussion in #9.

Thanks again @SimWdm for the prompt responses here!

SimWdm commented 1 week ago

Let me summarise: As observed by @RHennellJames and @rdrighetto (thanks again to both of you!), the torch subtomo solution does not seem to cause any slowdowns and has not caused any crashes during data loading so far. So I think it should be safe to merge the subtomo branch into main.

Any objections? 🙂

rdrighetto commented 1 week ago

Sounds good to me!

TomasPascoa commented 1 week ago

Hi, first of all, thanks so much for this discussion - it really helped me make sense of the errors I was getting!

Secondly, I wanted to share my experience: I tried both the MRC and Torch subtomo approaches, and both worked well in general. However, I found that the Torch subtomo solution succeeded for a tomogram where the MRC subtomo approach had previously failed due to the header error you described. I think I agree with @rdrighetto's assessment - it may be an I/O issue causing the MRC error.

That said, I ran into similar issues with a different dataset while using the .pt subtomos. The process ran for 779 epochs before failing, and I was able to restart from the last checkpoint and push it to completion, but it shows that a similar error is still possible with the Torch subtomo approach. Here’s the relevant part of the error message, which to me it feels it may be related to I/O issues:

_RuntimeError: Caught RuntimeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/ddw/utils/sub
tomo_dataset.py", line 60, in __getitem__
    subtomo0 = torch.load(subtomo0_file)
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 1072, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 480, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding
central directory

Epoch 779: 100%|██████████| 116/116 [01:51<00:00,  1.04it/s, loss=30, v_num=0, fitting_loss=29.10, val_loss=29.00]
real    951m23.637s
user    8650m1.955s
sys     198m50.308s_

TomasPascoa commented 1 week ago

Apologies @SimWdm - I just saw that you merged the .pt subtomos approach to master as I was writing this. I am still inclined to agree that it probably makes sense - but worth keeping in mind the issue may not be completely gone :)

SimWdm commented 1 week ago

Thanks for sharing your experience @TomasPascoa! No reasons to apologize: If the error persists (although in a different way), the issue should be kept open, so I have re-opened the issue.

Any help is highly appreciated! 😅

rdrighetto commented 1 week ago

@TomasPascoa confirms my fears have become true 😅

MLI-lab / DeepDeWedge

Sporadic error during training #8