dorianps / LINDA

Lesion Identification with Neighborhood Data Analysis
Apache License 2.0
20 stars 4 forks source link

Training script for LINDA #21

Open fgfmds opened 5 years ago

fgfmds commented 5 years ago

Dear Dorian,

In your FAQ's you mention that you have an example training script that you would make available upon request. I would like to re-train LINDA on some datasets that I have acquired. Can you please send me that training script? Will you be attaching it to a post or do you need to email it?

Thank you in advance!

dorianps commented 5 years ago

Sure, here it is.

PublishablePennModel.zip

This should be the file I used to create the model that currently comes with LINDA. It is simplistic and not well documented, that's why I did not publish it online.

If you are going to train a model, you may want to think in terms of what features (images) you will use, and prepare them all in template space. LINDA currently uses the six features/images produced by getLesionFeatures. It is usually a good idea to normalize all features at 0-1 values, and maybe also preprocess them by truncating intensities to remove bottom 1% and top 99% values (typically outliers). Once you have your features/images, the mrvnrfs function can take care of taking your images and training/testing a model. You don't need to stress out with how to extract voxel values to matrices and how to put back values to images.

I have tons of other files produced during the development of LINDA, but I don't want to confuse you with them. If you need any help, let me know. Some functions I used in the above script (i.e., mrvnrfs_chunks) were under development at the time and are not needed any more because they might be already in ANTsR (mrvnrfs, randomMask, splitMask).

Dorian

fgfmds commented 5 years ago

Dorian,

Thank you for your quick reply and for sending the script. I read your post and looked over the training script as well as the getLesionFeatures script. I really appreciate your offer to help, because to be completely honest, I will need your help for sure!

It looks like the first step is for me to create my new feature images. I will focus on that for now. My understanding is that for each training T1 image I will be using, I will have to run getLesionFeatures and generate the 6 feature images that are required for LINDA training (5 images plus the T1 itself).

Looking at the arguments, I see that getLesionFeatures requires the following inputs:

img: T1 training scan bmask: corresponding lesion mask template: not clear to me what this is. Which template am I using?

It also seems that your getLesionFeatures script does the normalization and intensity truncating, at least for some of the feature images. Please correct me if I am wrong, but I am observing the following explicit operations:

feat1: normalized but not truncated

feat2: normalized but not truncated

feat3: normalized and truncated

feat4: not normalized, not truncated

feat5: normalized and truncated

feat6: not normalized, not truncated (I noticed that both operations are commented out for feat6 in the training script)

Are the feature images that are not truncated and/or normalized ok to leave as is? should I modify the script to explicitly normalize and truncate all 6 feature images? or are they maybe normalized/truncated implicitly (via another function) or in some other script?

For the purposes of organizing the new feature images and getting them ready for training, it looks like the training script expects one folder per feature (all feat(i) images should be in one folder, i = 1 to 6). Am I reading this correctly?

Ultimately, I would like to be able to replicate training LINDA successfully, including all necessary pre-processing steps, and I am happy to document the feature generation process, and the training process, and of course share that documentation with you so that it can be made available to everyone.

Thank you!

dorianps commented 5 years ago

Dorian,

Thank you for your quick reply and for sending the script. I read your post and looked over the training script as well as the getLesionFeatures script. I really appreciate your offer to help, because to be completely honest, I will need your help for sure!

It looks like the first step is for me to create my new feature images. I will focus on that for now. My understanding is that for each training T1 image I will be using, I will have to run getLesionFeatures and generate the 6 feature images that are required for LINDA training (5 images plus the T1 itself).

I thought you wanted to try new features, like DWI+T1+etc., but if you want to follow the same exact path of the current LINDA model that is fine. I don't foresee any major benefit of a personalized model with the current features though. What may be of interest is to focus on smaller lesions, in which case it might be a good idea to go up with the resolution at 1mm. The biggest failure of the current LINDA model is with tiny lesions. This is because it was trained on big lesions, and it expects early predictions at low resolution to give a good initial guess, which doesn't happen with small lesions; they need to gain importance later with models of higher resolution.

Looking at the arguments, I see that getLesionFeatures requires the following inputs:

img: T1 training scan bmask: corresponding lesion mask template: not clear to me what this is. Which template am I using?

Those features are based on differences in patients of the corresponding template features. The template features come with LINDA and are located here: https://github.com/dorianps/LINDA/tree/master/inst/extdata

It also seems that your getLesionFeatures script does the normalization and intensity truncating, at least for some of the feature images. Please correct me if I am wrong, but I am observing the following explicit operations:

feat1: normalized but not truncated

feat2: normalized but not truncated

feat3: normalized and truncated

feat4: not normalized, not truncated

feat5: normalized and truncated

feat6: not normalized, not truncated (I noticed that both operations are commented out for feat6 in the training script)

Are the feature images that are not truncated and/or normalized ok to leave as is? should I modify the script to explicitly normalize and truncate all 6 feature images? or are they maybe normalized/truncated implicitly (via another function) or in some other script?

Sounds right, but note that Feat 6 in LINDA, which is the T1 itself, is already normalized before going into getLesionFeatures, that is the most important to bias correct, truncate, and normalize. Some imaging features are math calculations that are already normalized. Plus my old choices are not set in stone, i.e., you can chose to truncate at 0.999 if 0.99 seems too aggressive. But yes, if you just want to build the same LINDA model with your data, pass everything through getLesionFeatures (after taking care of the T1) and you should have what you need.

For the purposes of organizing the new feature images and getting them ready for training, it looks like the training script expects one folder per feature (all feat(i) images should be in one folder, i = 1 to 6). Am I reading this correctly?

That was an ad-hoc choice, I used to like to have all features in separate folders. You can save them any way you like, as long as at the end you create the list structure that mrvnrfs is expecting: a list of subjects, each containing a list of features/antsImages.

Ultimately, I would like to be able to replicate training LINDA successfully, including all necessary pre-processing steps, and I am happy to document the feature generation process, and the training process, and of course share that documentation with you so that it can be made available to everyone.

Thank you!

Sure, that would be helpful to others, too. As I mentioned above, the most interesting outcome is a model with different type of data. I have some DWI+T1 acute data that can be used for that purpose, if it interests you or someone else (no time to build models myself currently).

fgfmds commented 5 years ago

Dorian,

Thanks for the comprehensive and detailed response. Let me clarify a few things:

  1. I am a total novice in this area of research, and have a lot of learning to do!

  2. Of course, I would love to try new features (DWI+T1, etc..), but my approach is to learn how to walk first, then I will start jogging, and maybe even running.

  3. Although it will be more time consuming and require more effort, I believe the right course for me is to first replicate the same LINDA model implementation, and once I have a good understanding of it, I can then explore new paths and other possibilities.

  4. Frankly, a lot of the information you're providing is new to me, and I will need some time to digest it and understand it all, which is why I feel it would be wise to replicate your current work first. It will serve as good practice.

  5. Yes, I am fully aware of the challenges with small lesions, and I intend to tackle them at some point. They are definitely a lot more difficult to locate and predict.

  6. I would love to take a look at your acute data, and any pointers you may have for how to work with it and train on it, will be greatly appreciated.

Thank you!

dorianps commented 5 years ago

Ok, we can talk about the acute data when you have familiarity with these models. Good luck for now.

fgfmds commented 5 years ago

That's exactly my plan. Thank you kindly!