Open stekiri opened 2 years ago
in your callback, each TPU core is overwriting the same files, f"self.output_dir/{dataloader_idx}_{batch_idx}.pt" and
predictions.pt` . So when you're loading them afterwards, you're seeing only a portion of the total (whatever was saved last). Either open the files in append mode, or write to a different file per core and group them together afterwards.
If you open the file in append mode, be sure to close it afterwards at the end of prediction.
if you're partitioning the files, you can use trainer.global_rank
to distinguish each process's outputs
Thanks @ananthsub for your very helpful guidance!
Your suggestion to write it in 8 separate files works like a charm. For anyone that is coming across this issue in the future, here's how I modified the code:
class MultiFileWriter(BasePredictionWriter):
def __init__(self, output_dir: str, write_interval: str):
super().__init__(write_interval)
self.output_dir = output_dir
def write_on_batch_end(
self, trainer, pl_module: LightningModule, prediction: Any, batch_indices: List[int], batch: Any,
batch_idx: int, dataloader_idx: int):
torch.save(prediction, os.path.join(self.output_dir, f"predictions-dataloader_{dataloader_idx}-batch_{batch_idx}-globalrank_{trainer.global_rank}.pt"))
def write_on_epoch_end(
self, trainer, pl_module: LightningModule, predictions: List[Any], batch_indices: List[Any]):
torch.save(predictions, os.path.join(self.output_dir, f"predictions-globalrank_{trainer.global_rank}.pt"))
tmp_dir = "/tmp"
dm = MNISTDataModule()
model = LitModel(*dm.size(), dm.num_classes)
prediction_writer = MultiFileWriter(
output_dir=tmp_dir,
write_interval="epoch")
trainer = Trainer(
tpu_cores=8,
callbacks=[prediction_writer])
trainer.predict(model=model, datamodule=dm)
Unfortunately, I couldn't make your alternative suggestion work, the one that appends to a file. This is what I've tried:
class SingleFileWriter(BasePredictionWriter):
def __init__(self, file_buffer, write_interval: str):
super().__init__(write_interval)
self.file_buffer = file_buffer
def write_on_batch_end(
self, trainer, pl_module: LightningModule, prediction: Any, batch_indices: List[int], batch: Any,
batch_idx: int, dataloader_idx: int):
torch.save(prediction, self.file_buffer)
def write_on_epoch_end(
self, trainer, pl_module: LightningModule, predictions: List[Any], batch_indices: List[Any]):
torch.save(predictions, self.file_buffer)
tmp_dir = "/tmp"
dm = MNISTDataModule()
model = LitModel(*dm.size(), dm.num_classes)
with open(os.path.join(tmp_dir, 'sf_predictions.pt'), 'ab') as f:
prediction_writer = SingleFileWriter(
file_buffer=f,
write_interval="epoch")
trainer = Trainer(
tpu_cores=8,
callbacks=[prediction_writer])
trainer.predict(model=model, datamodule=dm)
It seems that all data is written as the file has the expected size, however, when reading the file using torch.load()
only an eighth of the predictions are actually in the loaded object. Looks like the written data is somehow colliding. Maybe you have another clever tip to make this work?
It seems that all data is written as the file has the expected size, however, when reading the file using torch.load() only an eighth of the predictions are actually in the loaded object. Looks like the written data is somehow colliding. Maybe you have another clever tip to make this work?
Could you try writing directly to the file buffer? for instance, does this work?
def write_on_batch_end(
self, trainer, pl_module: LightningModule, prediction: Any, batch_indices: List[int], batch: Any,
batch_idx: int, dataloader_idx: int):
self.file_buffer.write(<some value>)
def write_on_epoch_end(
self, trainer, pl_module: LightningModule, predictions: List[Any], batch_indices: List[Any]):
self.file_buffer.write(<some dummy value>)
I get the same behavior. I write with self.file_buffer.write(pickle.dumps(predictions)
and read it back with pickle.load()
as torch.load()
fails with RuntimeError: Invalid magic number; corrupt file?
when loading the written buffer.
Hi, this discussion was very helpful for me!
But I still need to figure out how do I save image file names along with their predictions. Also I am working with a very large dataset compared to MNST so may not be able to fit all the predictions in the TPU core memory (viz 8gb) at once.
I would really appreciate any possible help! Thanks
🐛 Bug
When writing predictions with a
torch.save
together with aBasePredictionWriter
(see this example) on Colab using a TPU runtime employing all 8 cores, only an eighth of the predictions are actually saved on disk.To Reproduce
The following code is based on the TPU tutorial with a few modifications:
Package installation:
Code:
When using
tpu_cores=[1]
, all predictions are saved correctly with the downside of only using one core instead of all eight.Expected behavior
The predictions from all cores should be saved in the file.
Environment
Colab with TPU runtime.
Additional context
Using the
BasePredictionWriter
was suggested in this issue. As requested by @kaushikb11, I created this new issue.cc @kaushikb11 @rohitgr7