Closed wangdada-love closed 9 months ago
Hi @wangdada-love,
Thank you for reaching out and sharing your code. It is very hard to just review the code and spot the bug. What I can recommend is to:
pad_last_batch=True
which would duplicate the last sample in the last batch overrepresenting it. While it may sense with other frameworks when the iterator can trim them and not present to the network it is currently not possible with Tensorflow so you can consider turning this off (at least for training)Maybe the above will help you narrow down the problem.
thanks for your response. I extracted some data for the experiment and found that the curve is normal. It should be a problem with my data
Describe the question.
Question description
I referred to official documents and examples to write a class for loading data through Dali, and used this class to load data for my model training. But during the training process, I found that the inference loss and inference IOU of the training were both oscillatory, and this oscillation was periodic. And,I have found through multiple experiments that the oscillation period matches the number of GPUs I have used. I suspect this is related to the data distribution method when using distributed strategies, but I cannot pinpoint where the specific problem lies. The following is the code for my data loading class. Could you please help me check if there are any issues and how should I resolve them?
code
dataload class
train
Check for duplicates