ChenWWWeixiang / diagnosis_covid19

OpenCovidDetector is an opensource COVID-19 diagnosis system implementing on pytorch, which is also as presented in our paper: Development and evaluation of an artificial intelligence system for COVID-19 diagnosis. Nat Commun 11, 5088 (2020).(https://doi.org/10.1038/s41467-020-18685-1)
MIT License
24 stars 17 forks source link

Why don't use all slices in CT scans for training but use one every 10 slices? #1

Closed hnguyentt closed 4 years ago

hnguyentt commented 4 years ago

In the file data/get_train_jpgs.py, why did you only use one every 10 slices in the CT scans for training but not all slices?

    for idx, i in enumerate(range(0,V.shape[0],10)):
        if not 'healthy' in set_name and False:
            if not i in sums2:
                continue
        data=V[i,:,:]
        data[data>500]=500
        data[data<-1200]=-1200#-1200~500
        data=data*255.0/1700
        data=data-data.min()

        data=np.stack([data,M[i,:,:]*data,M[i,:,:]*255],-1)#mask one channel
        data = data.astype(np.uint8)

        cv2.imwrite(os.path.join(output_path_slices,'nor_'+set_name+'_'
                                 +name.split(',')[0].split('/')[-1].split('.nii')[0]
                                 +'_'+str(int(i/(V.shape[0])*100))+'.jpg'),data)
ChenWWWeixiang commented 4 years ago

Thanks for so detailly reading my codes. The sampling stride was different for different types of data in our experiments, in order to keep numbers of different classes in the same order of magnitude. You can see in the paper that influenza data is so little and we had many COVID-19 and CAPs. The second reason is a parctical consideration that using all slices in training cost too much time and it influence the results slightly since the training slices we sampled and used were able to fit or almost overfit the network.

hnguyentt commented 4 years ago

Do you think that applying stride = 10 will probably skip the important slices for diagnosis in the scans?

ChenWWWeixiang commented 4 years ago

Sorry for no replying immediantly! Different types of data was sampled with different strides and so as to data of different centers. As a result, I can hardly remind why I used 10 but the question is really worth considerasion. Since we collected so many data that the influence is not crital, that the whole training process was similar to multi instance learnig with weak supervisoin.

hnguyentt commented 4 years ago

Thank you for your answer.