dhammack / DSB2017

Code for 2nd place solution to the 2017 National Data Science Bowl
346 stars 156 forks source link

Generation of annotations_enhanced.csv #3

Closed mkmohangb closed 7 years ago

mkmohangb commented 7 years ago

Hi Daniel,

Had a question regarding the generation of annotations_enhanced.csv. The first few columns are fairly straightforward(from luna16 annotations.csv and the mhd file itself). But how was the 'margin', 'lobulation', 'spiculation', 'malignancy' values generated? In LIDC xml, these features have integral values from 1 - 5 but in annotations_enhanced.csv, these features have fractional parts. Can you please explain?

seriesuid 1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793540579077826395208 margin 3.66667 lobulation 1.33333 spiculation 1.33333 malignancy 2.66667

Thanks.

dhammack commented 7 years ago

In LIDC xml, these features have integral values from 1 - 5 but in annotations_enhanced.csv, these features have fractional parts. Can you please explain?

Yes! Some nodules have multiple radiologist annotations in the LIDC xml data. When this occurred, I took the median. That should explain the fractional parts.

Let me know if you have any other questions.

mkmohangb commented 7 years ago

Thanks for the clarification. I assume you wanted to say mean and not median. I verified this for a couple of patients by taking the average.

Also, Luna 16 annotations has 1186 entries but the enhanced one has only 1172. What is the reason for leaving out 14 entries ?

dhammack commented 7 years ago

Yes, I remembered incorrectly. It was mean and not median.

I had a hard time joining the LUNA and LIDC datasets. I believe the 14 drops were due to nodules in the LUNA dataset that I couldn't map reliably to LIDC. I had to do the mapping a very convoluted way via parsing the XML files and matching nodule locations + sizes.

Hopefully LUNA or LIDC have released a more reliable mapping since I wrote this code, I know there was some talk about this on the LUNA website.

mkmohangb commented 7 years ago

Thanks for the explanation. I used Julian's LIDC xml parsing script to verify a couple of cases in annotations_enhanced.csv.